Dougal Campbell's geek ramblings

WordPress, web development, and world domination.

Cleaning up HTML

Quite some time ago, I coded up a PHP function that attempts to filter out potential nastiness in comments posted to MyPHPBlog sites (like this one). It’s mainly to keep people from injecting javascript into comments which could trigger Cross-Site Scripting problems. But I’ve never been completely satisfied with my solution. What I really wanted was a function that would not only filter for security, but would also turn the input into valid XHTML, automagically.

Simon Willison has come up with something that is a step in that direction: safeHtmlChecker. It’s a PHP class which will parse a chunk of XHTML and return a list of errors. But I want a class that will auto-correct the errors. Maybe it’s time to call upon the power of LazyWeb? Yes, I think it is….

LazyWeb, I invoke thee!

About Dougal Campbell

Dougal is a web developer, and a "Developer Emeritus" for the WordPress platform. When he's not coding PHP, Perl, CSS, JavaScript, or whatnot, he spends time with his wife, three children, a dog, and a cat in their Atlanta area home.
This entry was posted in Blogs and tagged , , . Bookmark the permalink.

7 Responses to Cleaning up HTML

  1. Danny says:

    The ‘standard’ tool for cleaning HTML (which can output XHTML) is HTML Tidy from the W3C (by Dave Raggett originally, I think) – I’m pretty sure you’ll find a PHP interface with a bit of googling. I use this both as a desktop app and (Java version) embedded in code, and its results really do seem magical at times.

  2. Dougal says:

    I thought I had heard something about a native way to call Tidy from PHP, and I did do a search at php.net, but that came up empty. I just didn’t have time to cast my net wider yet.

  3. Matt says:

    This looks like a working demo. I suspect they’re just passing it through the command line, which is what I’d recommend as the easiest solution as well.

  4. Dougal says:

    Yes, I had run across phpTidyHt earlier today. I’m definitely going to take a look at it, though I still hope to find a solution that doesn’t require a sytem call to an external program.

  5. Sean "XariusX" Maddison says:

    Try this:

    $foo = strip_tags($foo)

    then

    $foo = nl2br($foo)

    This will remove ALL HTML tags and format all newlines to

    If thats what you are after anyway.

  6. Dougal says:

    I already use the strip_tags() function (with the optional array of allowed tags). But it doesn’t do anything to help close unbalanced tags, or to strip out attempts to insert javascript urls (unless you disallow links altogether).

    I’m still mulling the problem over. Another solution might be to use something like BBCode, or a Wiki-like syntax.

  7. dusoft says:

    Check out Content management system Absolut Engine
    It produces valid XHTML Strict from WYSIWYG editor!

Leave a Reply

%d bloggers like this: