0

I'm trying to write a robot that will be fetching html parsing it daily. Now for parsing html i could use just string functions like explode, or regural expressions, but I found the dom xpath code much cleaner, so now I can make a configuration of all the sites I have to spider and tags I have to strip out like:

'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'

So the code looks like this

    @$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');


foreach ($tags as $tag) 
    echo $tag->nodeValue . "\n";

So with this I get all the div tags with class article description, which i great. But I noticed that all the html tags inside the div tag are stripped out. I wonder how would I get the whole contents of that div I'm looking at.

I also find it hard to see any proper documentation for $xpath->query() to see how to form the string. The php site doesn't tell much about the exact formation of it. Still, my main problem i

5

4 Answers 4

2
+50

The simple answer is:

foreach ($tags as $tag) 
    echo $dom->saveXML($tag);

If you want html unstripped a tags, the xpath would be

//a[@class="articleDesc"]

That's assuming the a tags have that class attribute

Sign up to request clarification or add additional context in comments.

Comments

1

Try using http://www.php.net/manual/en/simplexmlelement.asxml.php

Or, alternative:

function getNodeInnerHTML(DOMNode $oNode)   {
  $oDom = new DOMDocument();
  foreach($oNode->childNode as $oChild) {
    $oDom->appendChild($oDom->importNode($oChild, true));
  }
  return $oDom->saveHTML();
}

4 Comments

meh.. that would work in a way, but the perfect way for me would be to get from 'examplesite.com' => '//div/a[@class="articleDesc"]/@href' a list of html unstripped strings for the elements matching... I wonder how I'd do that
I might get you wrong here, but doesn't that just require you to get the innerHTML, using one of the functions above, of the parent element matching your XPath?
I think not.... inner html of the parent element matching xpath would return all the html inside it. However, I'd like to get all the div tags that have class article desc for instance...
So echo getNodeInnerHTML($tag) is not what you were looking for? If so, I'm having trouble understanding exactly what you want. Is it possible to show an example of your input, and the desired output?
0

This should load all of the inner tags as well. While its not DOM they are interchangeable. And later you can dom_import_simplexml tobring it back into DOM.

$xml=simplexml_load_string($html);
$tags=$xml->xpath('//body/div[@class="articleDesc"]');

4 Comments

giver an error. expath doesn't work with $xml. if I try to $xml = dom_import_simplexml($xml) prior to second line it doesn't work either
Exact error would be helpful. The first line imports the $html string into simplexml, if its not a string try simplexml_load_file instead. The second line is copied directly from yours but converted for simplexml. Admittedly I have not run it myself, but this is the same code I use at work, and it works for me there. dom_import_simplexml($tags) should only be used after the simplexml has been loaded and assuming you have something you want to do with it in DOM, otherwise it is not necessary, just included in case you wanted to switch back to DOM after loading the results.
simplexml_load_string($html) returns false and after I put that into xpath() it breaks of course... it also giver a lot of warnings like: Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 36: parser error : Opening and ending tag mismatch: META line 8 and HEAD in /usr/share/nginx/html/synd/robots/robot.php on line 25 I know the html may not be perfect which may be the cause of simplexml returning false, but it is a proper html webpage wtich gets rendered in browser
From the sounds of it your html isn't well formed. Which, while not necessary for it to show up in the browser properly, it is if you wish to use any kind of parser on it. Try closing your meta and head tags and try again. Meta tags are self-closing so just add a forward slash to the end of them, that's easy enough to forget. Once your html is well formed it should work.
0

You could use this awesome spider framework (in Python) Scrapy

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.