Quite some time ago, I coded up a PHP function that attempts to filter out potential nastiness in comments posted to MyPHPBlog sites (like this one). It’s mainly to keep people from injecting javascript into comments which could trigger Cross-Site Scripting problems. But I’ve never been completely satisfied with my solution. What I really wanted was a function that would not only filter for security, but would also turn the input into valid XHTML, automagically.
Simon Willison has come up with something that is a step in that direction: safeHtmlChecker. It’s a PHP class which will parse a chunk of XHTML and return a list of errors. But I want a class that will auto-correct the errors. Maybe it’s time to call upon the power of LazyWeb? Yes, I think it is….
LazyWeb, I invoke thee!
Cleaning up HTMLRelated posts:
- Fun Filters
"UPDATE 2008-09-22: This code was superceded by my Text Filter Suite plugin. You can download the current version of the plugin from the WordPress Plugin..." - Real World CSS
" For anyone who is still mired in the HTML Tag Soup of old-style web page design, the question, “Why should I switch to using..." - Future Hack
" One of the great things about the Internet is that sometimes you find the answers to problems before you even start searching. Actually, the..." - Clean URLs
" Simon has a post about clean urls using Apache’s mod_rewrite. But there is a lot of other good info on the subject in the..." - Asynchronous Weblogs.com Pings
" Hey, Dave Winer — how about implementing a method of invoking the weblogs.com ping that doesn’t do an immediate site check? It could just..."















7 Comments
The ’standard’ tool for cleaning HTML (which can output XHTML) is HTML Tidy from the W3C (by Dave Raggett originally, I think) – I’m pretty sure you’ll find a PHP interface with a bit of googling. I use this both as a desktop app and (Java version) embedded in code, and its results really do seem magical at times.
I thought I had heard something about a native way to call Tidy from PHP, and I did do a search at php.net, but that came up empty. I just didn’t have time to cast my net wider yet.
This looks like a working demo. I suspect they’re just passing it through the command line, which is what I’d recommend as the easiest solution as well.
Yes, I had run across phpTidyHt earlier today. I’m definitely going to take a look at it, though I still hope to find a solution that doesn’t require a sytem call to an external program.
Try this:
$foo = strip_tags($foo)
then
$foo = nl2br($foo)
This will remove ALL HTML tags and format all newlines to
If thats what you are after anyway.
I already use the strip_tags() function (with the optional array of allowed tags). But it doesn’t do anything to help close unbalanced tags, or to strip out attempts to insert javascript urls (unless you disallow links altogether).
I’m still mulling the problem over. Another solution might be to use something like BBCode, or a Wiki-like syntax.
Check out Content management system Absolut Engine
It produces valid XHTML Strict from WYSIWYG editor!