Symfony DomCrawler and HTML5

Working with HTML5 is not quite the same as working with xhtml or HTML 4.01. That’s why we wrote an HTML5 parser and writer for PHP. The built-in parser in PHP doesn’t produce a proper DOM when parsing elements that are different in HTML5 from the past versions.

Symfony has a DomCrawler component that lets you navigate html and xml documents. DomCrawler is similar to QueryPath. When html text is passed in it’s parsed using the built-in parser.

If you want to parse and navigate HTML5 you can use these two projects together.

Grab What You Need With Composer

Include the projects in your composer.json file. While symfony/css-selector isn’t required you are limited to using xpath without it. If you’re like me you’d rather use css selectors.

"require" : {
 	"masterminds/html5": "1.*",
	"symfony/dom-crawler": "2.*",
	"symfony/css-selector": "2.*"

Just Use Them

Using the HTML5 parser, turn the HTML5 content into a DOM.

// Load the html content into a DOM.
$dom = \HTML5::loadHTMLFile('path/to/file.html');

Then send this DOM into the DomCrawler. There are two ways you can easily do this. First, you can pass it in when the crawler instance is created.

// Pass in the DOM when the Crawler is created.
$crawler = new Crawler($dom);

Alternately, you can pass it in using the add(), addDocument(), or addNode() methods. For example:

$crawler = new Crawler();

Then use the crawler to find what you’re looking for. The Symfony site has some excellent documentation.

The built in DomCrawler method html() will produce HTML 4.01 styled HTML. If you want to get HTML5 back out use either \HTML5::saveHTML() or \HTML5::save(). For example,

$crawler->filter('track')->each(function($node, $i) {
	print \HTML5::saveHTML($node->getNode(0)) . PHP_EOL;