Quite some time ago, I coded up a PHP function that attempts to filter out potential nastiness in comments posted to MyPHPBlog sites (like this one). It’s mainly to keep people from injecting javascript into comments which could trigger Cross-Site Scripting problems. But I’ve never been completely satisfied with my solution. What I really wanted was a function that would not only filter for security, but would also turn the input into valid XHTML, automagically.
Simon Willison has come up with something that is a step in that direction: safeHtmlChecker. It’s a PHP class which will parse a chunk of XHTML and return a list of errors. But I want a class that will auto-correct the errors. Maybe it’s time to call upon the power of LazyWeb? Yes, I think it is….
LazyWeb, I invoke thee!












The ’standard’ tool for cleaning HTML (which can output XHTML) is HTML Tidy from the W3C (by Dave Raggett originally, I think) - I’m pretty sure you’ll find a PHP interface with a bit of googling. I use this both as a desktop app and (Java version) embedded in code, and its results really do seem magical at times.
I thought I had heard something about a native way to call Tidy from PHP, and I did do a search at php.net, but that came up empty. I just didn’t have time to cast my net wider yet.
This looks like a working demo. I suspect they’re just passing it through the command line, which is what I’d recommend as the easiest solution as well.
Yes, I had run across phpTidyHt earlier today. I’m definitely going to take a look at it, though I still hope to find a solution that doesn’t require a sytem call to an external program.
Try this:
$foo = strip_tags($foo)
then
$foo = nl2br($foo)
This will remove ALL HTML tags and format all newlines to
If thats what you are after anyway.
I already use the strip_tags() function (with the optional array of allowed tags). But it doesn’t do anything to help close unbalanced tags, or to strip out attempts to insert javascript urls (unless you disallow links altogether).
I’m still mulling the problem over. Another solution might be to use something like BBCode, or a Wiki-like syntax.
Check out Content management system Absolut Engine
It produces valid XHTML Strict from WYSIWYG editor!