
Okay, so in my post about Code Spelunking I mentioned about how working on a project can lead you to explore the code because you need to become more familiar with how the code works. But it can also lead you to explore the code to figure out why code doesn’t work. In this particular case, I spent many hours puzzling over why something didn’t work correctly, chasing down the root cause, and eventually finding a bug in the WordPress core. I documented the bug in Ticket #12394, provided a patch, and it was committed to core in Changeset [13561], which will be part of WordPress 3.0.
And how did I find this little buglet? As usual, it’s because I was doing something a little off the beaten track. I was working on some code which imports XML data into WordPress, on a scheduled basis (hourly, daily, weekly, etc). During testing, sometimes the images in the imported content would come through fine, and other times, they would be missing the src attribute, without which, there really isn’t an image, is there? So you’d view the post and there would be this big 300-pixel square hole with just the alt text where the image should have been.
At first, I didn’t know why it worked only some of the time. Then I saw the pattern that when I ran the code “manually” via a “Run now” button in my options screen, the images worked. But when the code ran via WP-Cron, they didn’t. At first, I thought it was a timing issue, and that maybe when the cron action hooks fired, maybe there was some piece of WordPress functionality that wasn’t loaded yet. But shunting my execution hook to run at a later point didn’t fix anything.
Next, I decided that one key difference when running manually versus running from cron was me — I was logged in as an admin. And, in fact, after some debugging, I determined that there was no user context at all when running from cron. When I modified the code to run as myself, the image tags came through cleanly. Well, I didn’t want to hard-code the program to always run as me, so I added a user selector to the options so that the owner of the posts could be set.
But then when I started testing again, with users of various roles, the problem cropped up again. In particular, it worked great for a user with the Editor role, but not for the Author role. Digging a little deeper into the differences between the two roles, the thing that jumped out at me is that Editors (and Admins) have the “unfiltered_html” capability.
You see, normally, when you write a post, it is sent through a series of filters which take your free-form writing, and turn it into cleaner HTML. One of these filters is called ‘kses‘ (which stands for ‘kses strips evil scripts’). This filter is especially important on multi-author blogs where you might not be able to give 100% trust to the other authors. Otherwise, one of them would be able to (for instance) put javascript in a post which would steal the cookie information from another user who reads the post. So it is the job of kses to ensure that only “safe” HTML is kept. This would also keep you from embedding things like YouTube videos, Java applets, and other fun useful things. So users with the unfiltered_html capability set in their profiles are able to post without this filtering.
This certainly seemed like a likely culprit, except for one thing: even when post content is filtered through kses, the HTML img tag is not filtered out. And neither is the src attribute on an image. That is specifically supposed to be allowed. An image is a perfectly normal thing to have in a post. So why, oh why, was my src attribute being stripped?
I started looking very closely at the kses library. It’s a rather hairy bit of code, full of complex regular expressions and state-machine logic. But when reverse-engineering how the attribute-cleaning bits work, I noticed something in one of the regular expressions: it was hardcoded to expect a space between the end of an attribute and the closing of a tag. In other words, it expected an image tag to look something like this:
<img width='400' height='300' src='people.jpg' />
But, since my data was coming from an XML source, there was no extraneous space. My image tags looked like this:
<img width='400' height='300' src='people.jpg'/>
Notice the subtle difference? There is no space between the final single-quote around 'people.jpg' and the /> which closes the tag. And because of the way the match was being done, kses was throwing away any attribute that abutted the tag-close in that fashion.
The next question was: was this (technically) a bug, or was kses just being strict about some rules of formatting? A quick search turned up the Empty Elements section of the XHTML spec, which covers the syntax for empty tags like img, br, and hr. The examples given there do not include a space before the end of the elements. Furthermore, this section points to the HTML Compatibility Guidelines, which show that adding a space is for compatibility with older HTML browsers. So, since the XHTML spec does not require the space, and WordPress is supposed to render XHTML code, the behavior in kses was definitely a bug, and not just bad manners. I quickly worked up a patch, submitted it on Trac, and brought it to the attention of the core team.
Fortunately, the WordPress system of filters allows you to alter just about anything on the fly, so I was able to “trick” the system into thinking that the posting user selected in my plugin had the unfiltered_html capability, even when they really didn’t. This allowed me to work around the bug while my plugin is running.
This bug was pretty minor in the grand scheme of things. Probably not many people had ever run into it. But after hours of puzzling over those broken image tags, it felt darned good to find it, and — more importantly — squash it. And after the release of WordPress 3.0, nobody will have to scratch their heads over it again. Yay me!
Pingback: Being Part of It | A Fool’s Wisdom