XML Entities, How do I hate thee? Let me count the

I hate HTML and XML Character Encoding, what a nightmare.

This all started yesterday. Actually, this all started several days ago with a post to our Campus “Network Administrators Group” mailing list – in which I opined on the benefits of RSS and RSS aggregators and queried for the group’s favorites.

Well, a couple of people pointed out that Thunderbird does RSS – ala an old fashioned NewsReader. So I wanted to try it out.

Only the current release of Thunderbird doesn’t natively provide support for importing a bunch of subscriptions (using, say an OPML. file). Well, that led me to this blog entry Which adds OPML import/export support to Thunderbird.

Cool right? Well somedays it doesn’t pay to get out of bed and write your own weblog software.

See, on import of my NetNewsWire OPML output, Thunderbird reported that two of the feeds were invalid.

RSS feeds are XML, and like any XML/XHTML source – they should be valid XML. However, it practically seems that this can be a total nightmare.

When EWE was released, I checked all the feeds with sample data at feedvalidator.org to make sure things looked okay.

Valid RSS right?

Well, valid until it wasn’t valid 🙂

What broke one of the feeds was the innocuous Copyright symbol. (C) In the source text, it was actually the copyright symbol ( (C) ). I have been dutifully running the RSS output through the htmlentities() function, to convert things like that to their HTML entities representation. XML doesn’t know anything about HTML entities built-in, except for just a few. I understand that, but you’d think that I could supply some namespace or something and fix that.

Well, I probably can, but I don’t understand that. Trying to understand would likely make me curse a lot (more).

So I tried the xmlentities() function that was buried in the comments for htmlentities() – that was nice until it didn’t do the &nbsp because PHP’s get_html_translation_table() has a limited set of entries in it, none of which are &nbsp – which isn’t valid for XML.

So, heretofore I found this article – which has 2000+ translations.

So now I run the RSS text through htmlentities – then through a strtr() with the gigantic translation table (with the < and > and &amp translations commented out).

I think I’ve fixed things – until something else breaks.

But by then – I think I’ll be like this chap and if:

1) You can type a character on the keyboard;  2) Browsers can display it (they better if (1) is true)  3) Printers can print it  4) Humans can read it	then the RSS feed is valid. This whole valid-invalid BS is making RSS difficult for both reader makers and publishers.

Until then: