Dom and xpath query for html parsing

Question

I'm trying to write a robot that will be fetching html parsing it daily. Now for parsing html i could use just string functions like explode, or regural expressions, but I found the dom xpath code much cleaner, so now I can make a configuration of all the sites I have to spider and tags I have to strip out like:

'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'

So the code looks like this

    @$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');


foreach ($tags as $tag) 
    echo $tag->nodeValue . "\n";

So with this I get all the div tags with class article description, which i great. But I noticed that all the html tags inside the div tag are stripped out. I wonder how would I get the whole contents of that div I'm looking at.

I also find it hard to see any proper documentation for $xpath->query() to see how to form the string. The php site doesn't tell much about the exact formation of it. Still, my main problem i

Nope, doesn't work for me. The function DOMinnerHTML($element) that's in the link doesnt work for my xpath object — Tadej Magajna
– Tadej Magajna, Commented Nov 20, 2011 at 22:37
Good XPath tutorial: schlitt.info/opensource/blog/0704_xpath.html — Matthew Turland
– Matthew Turland, Commented Nov 26, 2011 at 3:54

pguardiario · Accepted Answer · 2011-11-26 04:08:59Z

2

+50

The simple answer is:

foreach ($tags as $tag) 
    echo $dom->saveXML($tag);

If you want html unstripped a tags, the xpath would be

//a[@class="articleDesc"]

That's assuming the a tags have that class attribute

answered Nov 26, 2011 at 4:08

pguardiario

55.2k21 gold badges130 silver badges169 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sjaak Trekhaak · Accepted Answer · 2011-11-21 09:30:42Z

1

Try using http://www.php.net/manual/en/simplexmlelement.asxml.php

Or, alternative:

function getNodeInnerHTML(DOMNode $oNode)   {
  $oDom = new DOMDocument();
  foreach($oNode->childNode as $oChild) {
    $oDom->appendChild($oDom->importNode($oChild, true));
  }
  return $oDom->saveHTML();
}

answered Nov 21, 2011 at 9:30

Sjaak Trekhaak

4,95633 silver badges39 bronze badges

4 Comments

Tadej Magajna Over a year ago

meh.. that would work in a way, but the perfect way for me would be to get from 'examplesite.com' => '//div/a[@class="articleDesc"]/@href' a list of html unstripped strings for the elements matching... I wonder how I'd do that

Sjaak Trekhaak Over a year ago

I might get you wrong here, but doesn't that just require you to get the innerHTML, using one of the functions above, of the parent element matching your XPath?

Tadej Magajna Over a year ago

I think not.... inner html of the parent element matching xpath would return all the html inside it. However, I'd like to get all the div tags that have class article desc for instance...

Sjaak Trekhaak Over a year ago

So echo getNodeInnerHTML($tag) is not what you were looking for? If so, I'm having trouble understanding exactly what you want. Is it possible to show an example of your input, and the desired output?

mseancole · Accepted Answer · 2011-11-25 17:40:47Z

0

This should load all of the inner tags as well. While its not DOM they are interchangeable. And later you can dom_import_simplexml tobring it back into DOM.

$xml=simplexml_load_string($html);
$tags=$xml->xpath('//body/div[@class="articleDesc"]');

answered Nov 25, 2011 at 17:40

mseancole

1,6724 gold badges16 silver badges26 bronze badges

4 Comments

Tadej Magajna Over a year ago

giver an error. expath doesn't work with $xml. if I try to $xml = dom_import_simplexml($xml) prior to second line it doesn't work either

mseancole Over a year ago

Exact error would be helpful. The first line imports the $html string into simplexml, if its not a string try simplexml_load_file instead. The second line is copied directly from yours but converted for simplexml. Admittedly I have not run it myself, but this is the same code I use at work, and it works for me there. dom_import_simplexml($tags) should only be used after the simplexml has been loaded and assuming you have something you want to do with it in DOM, otherwise it is not necessary, just included in case you wanted to switch back to DOM after loading the results.

Tadej Magajna Over a year ago

simplexml_load_string($html) returns false and after I put that into xpath() it breaks of course... it also giver a lot of warnings like: Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 36: parser error : Opening and ending tag mismatch: META line 8 and HEAD in /usr/share/nginx/html/synd/robots/robot.php on line 25 I know the html may not be perfect which may be the cause of simplexml returning false, but it is a proper html webpage wtich gets rendered in browser

mseancole Over a year ago

From the sounds of it your html isn't well formed. Which, while not necessary for it to show up in the browser properly, it is if you wish to use any kind of parser on it. Try closing your meta and head tags and try again. Meta tags are self-closing so just add a forward slash to the end of them, that's easy enough to forget. Once your html is well formed it should work.

Lao · Accepted Answer · 2011-11-26 16:58:05Z

0

You could use this awesome spider framework (in Python) Scrapy

answered Nov 26, 2011 at 16:58

Lao

1913 bronze badges

Collectives™ on Stack Overflow

Dom and xpath query for html parsing

4 Answers 4

Comments

4 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related