Introducing A HTML5 Parser and Serializer for PHP

I’ve been amazed for some time that PHP didn’t have a html5 parser. For years PHP has been able to parser html4 and xhtml using libxml. But, html5 is different enough to matter. And over the past several years html5 has grown into the standard for markup. Only legacy applications use older specs.

Several years ago html5lib was started to fill this gap. Unfortunately, it has fallen into being unmaintained and isn’t up to the tasks we need. For most of the time it was hidden away as a sub-directory on Google Code and it has some methods that are thousands of lines long making them difficult to update.

A couple months ago Matt Butcher decided to bring a html5 parser and serializer to PHP and rope me into helping him. Working with html5 has been a major request for QueryPath which sparked his interest.

We started with html5lib but quickly found it needed a bit of work. So, we tacked that and ended up with a new parser, new serializer, composer support, automated tests using PHPUnit and Travis CI, and more.

After a few months of work we are ready with the first alpha release.

Installation

If you want to get started the easiest way is using composer. In a composer.json file for a project add a requirement for masterminds/html5. For example:

"require" : {
	"masterminds/html5" : "1.0.0-alpha1"
}

If you are not using composer, tagged versions can be downloaded from Github and will work with any PSR-0 autoloader.

Parsing with html5

While the architecture allows access to low level components along with their direct use and reuse, the easiest way to get started is with the built in functions. For example:

$html = '<!DOCTYPE html><html><body>test</body></html>';
$dom = \HTML5::loadHTML($html);

In this case $dom is a DOMDocument just like you’d get from the native parser in libxml. It will work with existing tools designed to work with it.

If you want to load markup directly from a file:

$dom = \HTML5::load('/path/to/file.html');

Serializing (Writing) html5

Just as we can parse html5 we can write it as well. For example:

// Get html from a DOM.
$html = \HTML5::saveHTML($dom);

// Write a DOM to a file as html5.
\HTML5::save($dom, '/path/to/file.html');

Learn More or Get Involved

If you want to learn more about what it can do, get involved in development, or just follow along the project is up on Github.

Update: if you want to follow along with QueryPath being updated to work with this html5 parser there is an issue for it.