1

How to extract data from HTML table in PHP. The data is in this format

Table 1

<tr><td class="body" valign="top"><a href="example"><b>DATA</b></a></td><td class="body" valign="top">Data_Text</td></tr>

Table 2

<tr><th><div id="Data">Data</div></th><td>Data_Text_1</td><td>Data_Text_2</td></tr>

Table 3

<tr><td width="120"><a href="example" target="_blank">DATA</a></td><td>Data_Text</td></tr>

I want to get the Data & Data_Text or (Data_Text_1 & Data_Text_2) from the 3 tables.
I've used

$html = file_get_contents($link);
$doc = new DOMDocument();
@$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes  = $xpath->query('//td[]');
$nodes2 = $xpath->query('//td[]');

But it cant show any data !

I'll offer bounty for this question on day after tomorrow

1
  • There seems to be some mistake: You cannot obtain "Data_Text" from Table 2 -- it doesn't have a text node with such string value. Please, edit and correct. Commented Apr 29, 2012 at 4:21

3 Answers 3

1

Using simplehtmldom.php...

<?php

include 'simple_html_dom.php';

$html = file_get_html('thetable.html');

$rows = $html->find('tr');
foreach($rows as $row) {
    echo $row->plaintext;
}

?>

or use 'td'...

<?php

include 'simple_html_dom.php';

$html = file_get_html('thetable.html');

$cells = $html->find('td');
foreach($cells as $cell) {
    echo $cell->plaintext;
}

?>
Sign up to request clarification or add additional context in comments.

Comments

0

Given an HTML document called xpathTables.html like this:

<html>
  <body>
    <table>
      <tbody>
        <tr><td class="body" valign="top"><a href="example"><b>DATA</b></a></td><td class="body" valign="top">Data_Text</td></tr>
      </tbody> 
    </table>

    <table>
      <tbody>
        <tr><th><div id="Data">Data</div></th><td>Data_Text_1</td><td>Data_Text_2</td></tr>
      </tbody>
    </table>

    <table>
      <tbody>
        <tr><td width="120"><a href="example" target="_blank">DATA</a></td><td>Data_Text</td></tr>
      </tbody>
    </table>
  </body>
</html>

And this PHP script:

<?php

$link = "xpathTables.html";

$html = file_get_contents($link);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$tables = $doc->getElementsByTagName('table');

$nodes  = $xpath->query('.//tbody/tr/td/a/b', $tables->item(0));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td[@class="body"]', $tables->item(0));
var_dump($nodes->item(1)->nodeValue);

$nodes  = $xpath->query('.//tbody/tr/th/div[@id="Data"]', $tables->item(1));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td', $tables->item(1));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td', $tables->item(1));
var_dump($nodes->item(1)->nodeValue);

$nodes  = $xpath->query('.//tbody/tr/td/a', $tables->item(2));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td', $tables->item(2));
var_dump($nodes->item(1)->nodeValue);

You get this output:

string(4) "DATA"
string(9) "Data_Text"
string(4) "Data"
string(11) "Data_Text_1"
string(11) "Data_Text_2"
string(4) "DATA"
string(9) "Data_Text"

I didn't understood well your question, so I made this example in order to show all the text nodes your tables had. If you are only interested in some of those nodes, you should pick the XPath queries that do the job.

I included the tags table and tbody, just to make the example more HTML like.

Comments

0

Use this single XPath expression:

/*/table/tr//text()[normalize-space()]

This selects any text-node that consists not only odf white-space characters and that is a descendant of any tr element that is a child of a table element that is a child of the top element of the document.

XSLT - based verification:

 <xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/table/tr//text()[normalize-space()]"/>

. . . . . . .
  <xsl:for-each select=
    "/*/table/tr//text()[normalize-space()]">
    "<xsl:copy-of select="."/>"
  </xsl:for-each>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied against the following XML document:

<html>
 <table>
    <tr>
        <td class="body" valign="top">
            <a href="example">
                <b>DATA</b>
            </a>
        </td>
        <td class="body" valign="top">Data_Text</td>
    </tr>
 </table>

 <table>
    <tr>
        <th>
            <div id="Data">Data</div>
        </th>
        <td>Data_Text_1</td>
        <td>Data_Text_2</td>
    </tr>
 </table>

 <table>
    <tr>
        <td width="120">
            <a href="example" target="_blank">DATA</a>
        </td>
        <td>Data_Text</td>
    </tr>
 </table>
</html>

the XPath expression is evaluated and the selected text nodes are output (twice -- once as the result of the evaluation and they appear concatenated, the second time each selected node is output on a separate line and surrounded by quotes):

DATAData_TextDataData_Text_1Data_Text_2DATAData_Text

. . . . . . .

"DATA"

"Data_Text"

"Data"

"Data_Text_1"

"Data_Text_2"

"DATA"

"Data_Text"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.