Dougal Campbell's geek ramblings

WordPress, web development, and world domination.

Bug Chasing

Okay, so in my post about Code Spelunking I mentioned about how working on a project can lead you to explore the code because you need to become more familiar with how the code works. But it can also lead you to explore the code to figure out why code doesn’t work. In this particular case, I spent many hours puzzling over why something didn’t work correctly, chasing down the root cause, and eventually finding a bug in the WordPress core. I documented the bug in Ticket #12394, provided a patch, and it was committed to core in Changeset [13561], which will be part of WordPress 3.0.

And how did I find this little buglet? As usual, it’s because I was doing something a little off the beaten track. I was working on some code which imports XML data into WordPress, on a scheduled basis (hourly, daily, weekly, etc). During testing, sometimes the images in the imported content would come through fine, and other times, they would be missing the src attribute, without which, there really isn’t an image, is there? So you’d view the post and there would be this big 300-pixel square hole with just the alt text where the image should have been.

At first, I didn’t know why it worked only some of the time. Then I saw the pattern that when I ran the code “manually” via a “Run now” button in my options screen, the images worked. But when the code ran via WP-Cron, they didn’t. At first, I thought it was a timing issue, and that maybe when the cron action hooks fired, maybe there was some piece of WordPress functionality that wasn’t loaded yet. But shunting my execution hook to run at a later point didn’t fix anything.

Next, I decided that one key difference when running manually versus running from cron was me — I was logged in as an admin. And, in fact, after some debugging, I determined that there was no user context at all when running from cron. When I modified the code to run as myself, the image tags came through cleanly. Well, I didn’t want to hard-code the program to always run as me, so I added a user selector to the options so that the owner of the posts could be set.

But then when I started testing again, with users of various roles, the problem cropped up again. In particular, it worked great for a user with the Editor role, but not for the Author role. Digging a little deeper into the differences between the two roles, the thing that jumped out at me is that Editors (and Admins) have the “unfiltered_html” capability.

You see, normally, when you write a post, it is sent through a series of filters which take your free-form writing, and turn it into cleaner HTML. One of these filters is called ‘kses‘ (which stands for ‘kses strips evil scripts’). This filter is especially important on multi-author blogs where you might not be able to give 100% trust to the other authors. Otherwise, one of them would be able to (for instance) put javascript in a post which would steal the cookie information from another user who reads the post. So it is the job of kses to ensure that only “safe” HTML is kept. This would also keep you from embedding things like YouTube videos, Java applets, and other fun useful things. So users with the unfiltered_html capability set in their profiles are able to post without this filtering.

This certainly seemed like a likely culprit, except for one thing: even when post content is filtered through kses, the HTML img tag is not filtered out. And neither is the src attribute on an image. That is specifically supposed to be allowed. An image is a perfectly normal thing to have in a post. So why, oh why, was my src attribute being stripped?

I started looking very closely at the kses library. It’s a rather hairy bit of code, full of complex regular expressions and state-machine logic. But when reverse-engineering how the attribute-cleaning bits work, I noticed something in one of the regular expressions: it was hardcoded to expect a space between the end of an attribute and the closing of a tag. In other words, it expected an image tag to look something like this:

<img width='400' height='300' src='people.jpg' />

But, since my data was coming from an XML source, there was no extraneous space. My image tags looked like this:

<img width='400' height='300' src='people.jpg'/>

Notice the subtle difference? There is no space between the final single-quote around 'people.jpg' and the /> which closes the tag. And because of the way the match was being done, kses was throwing away any attribute that abutted the tag-close in that fashion.

The next question was: was this (technically) a bug, or was kses just being strict about some rules of formatting? A quick search turned up the Empty Elements section of the XHTML spec, which covers the syntax for empty tags like img, br, and hr. The examples given there do not include a space before the end of the elements. Furthermore, this section points to the HTML Compatibility Guidelines, which show that adding a space is for compatibility with older HTML browsers. So, since the XHTML spec does not require the space, and WordPress is supposed to render XHTML code, the behavior in kses was definitely a bug, and not just bad manners. I quickly worked up a patch, submitted it on Trac, and brought it to the attention of the core team.

Fortunately, the WordPress system of filters allows you to alter just about anything on the fly, so I was able to “trick” the system into thinking that the posting user selected in my plugin had the unfiltered_html capability, even when they really didn’t. This allowed me to work around the bug while my plugin is running.

This bug was pretty minor in the grand scheme of things. Probably not many people had ever run into it. But after hours of puzzling over those broken image tags, it felt darned good to find it, and — more importantly — squash it. And after the release of WordPress 3.0, nobody will have to scratch their heads over it again. Yay me!

About Dougal Campbell

Dougal is a web developer, and a "Developer Emeritus" for the WordPress platform. When he's not coding PHP, Perl, CSS, JavaScript, or whatnot, he spends time with his wife, three children, a dog, and a cat in their Atlanta area home.
This entry was posted in WordPress and tagged , , , , , , , , , , , , , , , , . Bookmark the permalink.

16 Responses to Bug Chasing

  1. Mac_Boy says:

    Thanks for the story, well explained.

    Thank you for catching anf fixing a tuff nut!

    :-)

  2. Just wanted to pop in and offer a sincere Thank You! for contributing to the WP core. I am a heavy WP user and rely on the skills of people like you for my web development survival.

  3. Andrea_R says:

    You know, I have actually come across this bug once before. Cool that it finally got tracked down.

  4. John Sostak says:

    I’ve become pretty hands on with WP, because it is user friendly. Although I’m a marketer, I obviously am touching the code. It is awesome how in open source, pro’s like you collaborate to work out these bugs, so people like “us” can be more efficient.

    That being said, I am constantly scratching my head on something minor, and Dan or Alessio fix it in 2 minutes. Thanks, John

  5. Izba says:

    Thank you Dougal for sorting that bug out. I bet for you it must had been like a puzzle you had to bring to an end by fixing it. Sure, without the “src” in an image inserting code there wouldn’t be any image shown :) !

    Because of the 300px x 300px emptiness; something is telling me I may have encountered this in the past, but it well could have been only a manual coding I was doing.

    All the best.

  6. Linda 16C says:

    Thank you Dougal for sharing so significative story .I also think you must like a puzzle you had to bring to an end by sloving it.

    “after hours of puzzling over those broken problems, it felt darned good to find it, and — more importantly — squash it. ” ——-I agree vrey much!

  7. Outstanding work, and a great write-up for people that maybe haven’t gotten involved yet but had that gut feeling they were staring at a little ugly bug that needed squashing.

    • Dougal says:

      One of the main things that I’m hoping people will take away from this is that even though a bug may be “small”, sometimes it still takes a lot of time and effort to understand and fix it.

      If I hadn’t noticed some very small details, I might still be trying to track it down. First, I had to notice that the src attribute was always the last one in my XML source. Then I had to notice the lack of a space before the tag-close. If not for that, I might still have looked at kses, but not made the connection about the space (or lack thereof).

  8. Great writeup. The whole kses system is kind of a bitch. Whenever I come across it it’s in bad circumstances, especially in WPMU where kses has traditionally been extra annoying and mysterious.

    For anyone having issues with unfiltered_html its important to remember that without a plugin there is no real way to give your authors full trust. If you are using the ‘author’ or ‘contributor’ roles its very important to consider modifying the default permissions to allow unfiltered_html for those roles.

    In my experience it is very likely that you do trust your users not to insert malicious code, and that you are limiting their role (i.e. not making them ‘editor’) for organizational or editorial reasons. In this case you should install a role-management plugin and give the ‘unfiltered_html’ capability to anyone who you expect to be writing posts, otherwise they will eventually discover something that gets stripped from their posts, and often users will just give up instead of asking about it because it is experienced as a strange bug that destroys the content rather than a security measure (“I just thought that the video/widget/whatever wouldn’t work with WordPress”).

    Capability Manager seems to be the most stable role/capability plugin out these days. Once its installed you can edit the roles to have the extra capability.

    IMHO the kses filtering system should be modified to generate visible errors for users whenever some content gets stripped. After the user submits the page to save they should see a message at the top:

    The following HTML has been stripped from your post because you do not have permission to use tags included within it. If you believe this is an error please ask your site administrator for help.

  9. reviste says:

    the title sounds like a movie title… the article looks also like a screenplay. we shall see you at the academy awards ceremony next year?

  10. sara says:

    Thanks for sharing about how you squashed the hell out of that bug! I’m sure you just inspired many people to be pro-active when they find bugs too. You rock!

  11. felix says:

    do you know why worpress don’t show the language-preferences, which i uploaded manual on my server, in the admin-area?

    in config.php i edit this, too.

    I can’t remember, hope for help :-)

    felix

  12. Pingback: Being Part of It | A Fool’s Wisdom

  13. I use WP as a CMS for most of my web design clients and have found it the most simplest to install and easiest for the client to use. All thanks to all the work developers such as yourself are doing like this this, we kinda take for granted. So a big thanks and keep up the amazing work.

Leave a Reply

%d bloggers like this: