Dougal Campbell's geek ramblings

WordPress, web development, and world domination.

Dealing with UTF in node.js

In case you’ve never heard of it, I wrote a little Twitter friend/follower cross-reference tool a few years ago. Basically, I was wondering which of the people I followed on Twitter also followed me back, who didn’t follow back, who followed me that I didn’t follow back, and all the permutations around those ideas. After a couple of days of hacking, Twitual was born.

One of the main problems with Twitual, however, is that the data gathering and analysis is linear. When you submit a Twitter username, the server has to fetch all of the friends, fetch all of the followers, and do all of the set calculations before it can send a single thing back to your browser. Due to the way the Twitter API works, it may have to make many HTTP calls to the Twitter servers. It can be slow, and you’re left with a browser that appears to be frozen, with nothing happening, until it finally updates the page all at once.

For a long time, I’ve wanted to rewrite Twitual to be more dynamic, and give more friendly feedback while it’s working. So some time ago, I started toying around with rewriting things using nodejs instead of PHP. My early experiements were promising. I used socket.io for real-time client/server updates, found a good Twitter API library for node, and threw together a quick prototype. But after playing with it for a bit, I kept seeing occasional failures where the server side just seemed to stop working, along with a mysterious error in my browser’s JavaScript console.

On the server-side, I did not see any errors. Everything appeared to be working fine. But in the browser, I saw an error stating “Could not decode a text frame as UTF-8″. This error did not come from my code, or any of the libraries I was using. I finally determined that it was from the Chrome browser itself. And when this error occurred, the browser unceremoniously dropped the connection to the server, disconnecting the socket.io link, and from that point, there was no way to recover.

After a little bit of searching, and with some serendipitous remarks heard in the NodeUp podcast, I began to narrow this down to how the V8 JavaScript engine (at the heart of nodejs) handles Unicode. But even knowing what kind of problem I might be dealing with, I still spent hours trying to find out at what level the problem surfaced, and if it was possible to for me to do anything about it.

There were multiple places where the problem might actually be triggered: in the nodejs server itself (possibly within V8), in the nodejs JavaScript system libraries (the http module, for example), in the ntwitter library I’m using for Twitter API calls, in the socket.io library, or perhaps some failing in my own code.

Knowing (well, 99% sure) that the problems were probably caused by Emoji (a Japanese character set for graphical emoticons) or other odd Unicode characters coming in from Twitter, my first investigations were in the http system module the ntwitter module. I thought perhaps that when these characters came into node and the JSON data was transformed from raw strings into JavaScript objects, some untrapped error might occur there. However, after further testing, it seemed that retrieving the Twitter data was not the problem.

My next suspect was the socket.io module. This idea seemed to be supported by testing that revealed that everything went fine until I attempted to send my JSON-encapsulated Twitter data to the browser, plus the accompanying “Could not decode a text frame as UTF-8″ message from Chrome. But after more searching and reading, it seemed more and more like this problem was not specific to socket.io, or any other module.

In all of my searching, I seemed to keep coming back to this one particular message thread on the V8 issues queue about UTF-8 encoding/decoding problems. This seemed to be the crux of the problem, and after digesting what I read there, and experimenting with some code, I saw what was happening.

As explained in that thread, “V8 currently only accepts characters in the BMP as input, using UCS-2 as internal representation (the same representation as JavaScript strings).” Basically, this means that JavaScript uses the UCS-2 character encoding internally, which is strictly a 16-bit format, which in turn means that it can only support the first 65,536 code-points of Unicode characters. Any characters that fall outside that range are apparently truncated in the conversion from UTF-8 to UCS-2, mangling the character stream. In my case (as with many others I found in my research) this surfaces when the system attempts to serialize/deserialize these strings as JSON objects. In the conversion, you can end up with character sequences which are invalid UTF-8. When browsers see these broken strings come in, they promptly drop the connection mid-stream, apparently as a security measure. (I sort-of understand this, but would have a hard time explaining it, because these character-encoding discussions give me a headache).

Fortunately, that same V8 issue discussion has a work-around. You can do additional encoding/decoding which will escape the troublesome byte sequences. Since my case was dealing with JSON instead of just plain strings, my implementation tosses the JSON serialization/deserialization into the mix:

While searching for solutions to the problem, I saw a lot of other people frustrated by similar symptoms. I’m hoping that by posting this, others will be able to find it and apply this solution, or something similar. Also, it appears that there are still discussions about modifying the behavior of V8 to better handle this encoding issue, hopefully in a completely transparent fashion.

Oh, and if you’re interested, you can try out the Twitual 2.0 Prototype. It’s very much a work-in-progress since I’ve mostly been trying to solve these underlying issues. So right now, there’s practically no UI, and it still needs better error handling for when something goes wrong with the Twitter API. Once I settle on a templating solution, the look-and-feel of the whole thing is going to start changing radically. And again, this is a prototype, and at any given time it might not even be up and running. You’ve been warned. :-)

 

About Dougal Campbell

Dougal is a web developer, and a "Developer Emeritus" for the WordPress platform. When he's not coding PHP, Perl, CSS, JavaScript, or whatnot, he spends time with his wife, three children, a dog, and a cat in their Atlanta area home.
This entry was posted in Development and tagged , , , , , , , , . Bookmark the permalink.

6 Responses to Dealing with UTF in node.js

  1. Techload says:

    First of all I must say that Twitual is terrific.
    Second, it is fascinating your line of investigation up to the point where you find the culprit.
    And last, but not least:
    Hey, Chrome developers: please, address these V8 encoding issues!

    • Dougal says:

      It’s not just a Chrome thing. It’s a condition of the V8 JavaScript, which is also in the node.js server. And even though Firefox uses a different engine, it exhibited similar behavior (dropping the connection when it received invalid UTF sequences).

      I just saw a note the other day that indicates that node.js may have a patch coming to deal with this internally. Of course, I already have a functioning work-around, so I’m not too worried about it at the moment. But at least in the future I might not have to manually deal with it in my code.

  2. Art says:

    Dougal,

    You are effin’ rock star man!

    Thanks for all your time and effort which went into investigating this! The workaround worked wonders!

    Cheers,
    Art

  3. Pingback: What’s new with Twitual? – Twitual Blog

  4. Pingback: Don’t use strlen() - WordPress Blog Man

  5. Ed says:

    V8 supports UTF-16 directly. Not sure if anyone in the node community is using this, but it would help you.

    UTF-16 is about 97% compatibility with UCS-2 according to wikipedia.

    Here is the V8 support:
    https://github.com/v8/v8/blob/master/include/v8.h#L1186

Leave a Reply

%d bloggers like this: