Alright Encoding, Let’s Do It

My former colleagues ran into an issue recently with a Rails 3.1 application when they upgraded to the latest versions of several gems where text stored in a serialized field suddenly started showing the bytecodes for accented quotes e.g. I don’t suddenly turned into I donâu0080u0099t

Let’s pause for station identification (here, watch this duel for some cinematic flavor) and write up a few terms for google to find to save others this headache: Problem. Encoding. YAML. Serialized. Rails. Delayed Job. Upgrade. Syck. Pysch. Characters look funny. Display Issues. Latin1 is the root of all evil. UNHOLY TEXT CRAPTASM.

They resolved it with some phpmyadmin text field editing. But I thought I had beat down this encoding mess once and for all with a great big utf-8 mysql push years ago, and heading into the promised land that was Ruby 1.9 with regard to string handling. So I wanted to know the root cause.

What went wrong?

I had this yaml file of stock questions that I used to seed the database. Unfortunately I paid no attention to what was actually being stored in the database.

Let’s observe – I can’t paste the “right single quotation mark” (which the Mac OSX character viewer gleefully reports as Unicode: U+2019, UTF-8: E2 80 99) into IRB, but I can cheat:

And when we convert that to yaml as Rails does when serializing it (by default):

Doh! But then de-yamling it seems okay:

Which is why I never noticed. I mean, how often do you look at a man’s shoes? er. I mean, in the database. Sorry, mixing the movie metaphors.

But that was until after the gem upgrade – which we’ll simulate here with a hint of foreshadowing:

Doh! And all I wanted was an normal encoding-free life.

So after observing the problem in its native form, I turn to google – which turns up this stackoverflow post – and yep:

We have the culprit! But not where it’s coming from.

At first, I blame rails, because that’s usually the easiest thing to do right? Surely they changed something between 3.1 and 3.2? But searching the source code, and grepping the log indicates that rails got some pysch tenderlove a long time ago.

So them I do a grep on the gems:

And there we have it and here’s why (Note Aaron Patterson’s prophetic warning) – Delayed Job 3 doesn’t force ‘syck’ anymore, so it fell back to ‘psych’.

There’s still the issue of cleaning up the old data, and while it’s a little late for my colleagues, an easy fix (though you may want to turn off timestamping) for our serialized fields (at least for the stock questions) could have been:

I’m sure I’ll meet up with encoding again. Then we’ll have us another reckoning.

p.s. syck must have the most unique dual license ever