Alright Encoding, Let’s Do It

My former colleagues ran into an issue recently with a Rails 3.1 application when they upgraded to the latest versions of several gems where text stored in a serialized field suddenly started showing the bytecodes for accented quotes e.g. I don’t suddenly turned into I donâu0080u0099t

Let’s pause for station identification (here, watch this duel for some cinematic flavor) and write up a few terms for google to find to save others this headache: Problem. Encoding. YAML. Serialized. Rails. Delayed Job. Upgrade. Syck. Pysch. Characters look funny. Display Issues. Latin1 is the root of all evil. UNHOLY TEXT CRAPTASM.

They resolved it with some phpmyadmin text field editing. But I thought I had beat down this encoding mess once and for all with a great big utf-8 mysql push years ago, and heading into the promised land that was Ruby 1.9 with regard to string handling. So I wanted to know the root cause.

What went wrong?

I had this yaml file of stock questions that I used to seed the database. Unfortunately I paid no attention to what was actually being stored in the database.

Let’s observe – I can’t paste the “right single quotation mark” (which the Mac OSX character viewer gleefully reports as Unicode: U+2019, UTF-8: E2 80 99) into IRB, but I can cheat:

% echo "I’m your huckleberry." > test.yml% rails consoleLoading development environment (Rails 3.1.3)>> string = YAML.load(File.open('test.yml'))=> "I’m your huckleberry."

And when we convert that to yaml as Rails does when serializing it (by default):

>> string.to_yaml=> "--- "I\xE2\x80\x99m your huckleberry."n"

Doh! But then de-yamling it seems okay:

>> newstring = YAML.load(string.to_yaml)=> "I’m your huckleberry."

Which is why I never noticed. I mean, how often do you look at a man’s shoes? er. I mean, in the database. Sorry, mixing the movie metaphors.

But that was until after the gem upgrade – which we’ll simulate here with a hint of foreshadowing:

>> yamlstring = string.to_yaml=> "--- "I\xE2\x80\x99m your huckleberry."n">> YAML::ENGINE.yamler = 'psych'=> "psych">> newstring = YAML.load(yamlstring)=> "Iâu0080u0099m your huckleberry."

Doh! And all I wanted was an normal encoding-free life.

So after observing the problem in its native form, I turn to google – which turns up this stackoverflow post – and yep:

% rails consoleLoading development environment (Rails 3.1.3)>> YAML::ENGINE.yamler => "syck"

We have the culprit! But not where it’s coming from.

At first, I blame rails, because that’s usually the easiest thing to do right? Surely they changed something between 3.1 and 3.2? But searching the source code, and grepping the log indicates that rails got some pysch tenderlove a long time ago.

% git log | grep 'psych'c29eef7 [1 year, 2 months ago] (Aaron Patterson) load psych by default if possible59f3218 [1 year, 2 months ago] (Aaron Patterson) load and prefer psych as the YAML parser when it is available

So them I do a grep on the gems:

% grep -ir 'syck' .[...]./delayed_job-2.1.4/lib/delayed/yaml_ext.rb:YAML::ENGINE.yamler = "syck" if defined?(YAML::ENGINE)

And there we have it and here’s why (Note Aaron Patterson’s prophetic warning) – Delayed Job 3 doesn’t force ‘syck’ anymore, so it fell back to ‘psych’.

% rails console                                              Loading development environment (Rails 3.2.2)>> YAML::ENGINE.yamler => "psych">> string = YAML.load(File.open('test.yml'))=> "I’m your huckleberry.">> string.to_yaml=> "--- I’m your huckleberry.n...n"

There’s still the issue of cleaning up the old data, and while it’s a little late for my colleagues, an easy fix (though you may want to turn off timestamping) for our serialized fields (at least for the stock questions) could have been:

>> YAML::ENGINE.yamler = 'syck'>> all_responses = {}>> StockQuestion.all.map{|sq| all_responses[sq.id] = sq.responses}>> YAML::ENGINE.yamler = 'psych'>> StockQuestion.all.each do |sq|>> sq.responses = all_responses[sq.id]>> sq.save!>> end

I’m sure I’ll meet up with encoding again. Then we’ll have us another reckoning.

p.s. syck must have the most unique dual license ever