28

I know we can use PHP DOM to parse HTML using PHP, but I have a specific requirement. I have an HTML content like below

<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>

I want to parse the above HTML and save the content into two different arrays like:

$heading and $content

$heading = array('Chapter 1', 'Chapter 2', 'Chapter 3');
$content = array('This is chapter 1', 'This is chapter 2', 'This is chapter 3');

I can achieve this simply using jQuery. But I am not sure, if that's the right way.

6
  • use jquery as its structure is simple. Commented Aug 21, 2013 at 4:58
  • @Susheel: HTML content will be much bigger as it is the output after parsing docx files Commented Aug 21, 2013 at 5:00
  • You could use regular expressions if you don't like to go for PHP DOM. Commented Aug 21, 2013 at 5:00
  • 6
    @LorenzMeyer do not use regular expressions to parse html Commented Aug 21, 2013 at 5:06
  • @blessed for bigger dom use php simple dom parser Commented Aug 21, 2013 at 5:08

7 Answers 7

31

I have used DOMDocument and DOMXPath to get the solution:

$test = <<< HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($test);
$xpath = new DOMXPath($dom);
$heading = parseToArray($xpath,'Heading1-H');
$content = parseToArray($xpath,'Normal-H');

var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";

function parseToArray(DOMXPath $xpath, string $class): array
{
    $xpathquery = "//*[@class='$class']";
    $elements = $xpath->query($xpathquery);

    $resultarray = [];
    foreach ($elements as $element) {
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
          $resultarray[] = $node->nodeValue;
        }
    }

    return $resultarray;
}
Sign up to request clarification or add additional context in comments.

1 Comment

I've found this link to be very useful to learn the XPATH.query syntax: w3schools.com/xml/xpath_syntax.asp
24

Try to look at PHP Simple HTML DOM Parser

It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class

// include/require the simple html dom parser file

$html_string = '
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
    if ($element->class === 'Heading1-H') {
        $heading[] = $element->innertext;
    }else if($element->class === 'Normal-H') {
        $content[] = $element->innertext;
    }
}

7 Comments

!!NOTICE!! not using "->innertext" leads to memory leaks.
This is a much easier option and produces more readable code compared to using DomDocument.
Is there an option to install that with composer?
Composer install is now possible: composer require simplehtmldom/simlehtmldom dev-master and use simplehtmldom\HtmlWeb;
@luckydonald there is a typo in your comment. missing the "p" in the second "simple" in the composer require command
|
12

Here's an alternative way to parse the html using DiDOM.

composer require imangazaliev/didom
<?php

use DiDom\Document;

require_once('vendor/autoload.php');

$html = <<<HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$document = new Document($html);

// find chapter headings
$elements = $document->find('.Heading1-H');

$headings = [];

foreach ($elements as $element) {
    $headings[] = $element->text();
}

// find chapter texts
$elements = $document->find('.Normal-H');

$chapters = [];

foreach ($elements as $element) {
    $chapters[] = $element->text();
}

echo("Headings\n");

foreach ($headings as $heading) {
    echo("- {$heading}\n");
}

echo("Chapter texts\n");

foreach ($chapters as $chapter) {
    echo("- {$chapter}\n");
}

4 Comments

@miken32 why the edit?
Because this isn't an advertisement, micro-optimizations are typically a waste of time, and everyone already knows Simple HTML DOM is trash.
Or at least they should by now lol
Respectfully disagree.
6

One option for you is to use DOMDocument and DOMXPath. They do require a bit of a curve to learn, but once you do, you will be pretty happy with what you can achieve.

Read the following in php.net

http://php.net/manual/en/class.domdocument.php

http://php.net/manual/en/class.domxpath.php

Hope this helps.

1 Comment

This has problem with broken html
0

Here is the functional-style equivalent of @saji89's answer. Search for any element on any level which has the desired class (use contains() if there may be multiple classes assigned to an element), then target the node text with text(). After converting the XPath object to an array, simply isolate the nodeValue column.

Code: (Demo)

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach (['Heading1-H', 'Normal-H'] as $class) {
    var_export(
        array_column(
            iterator_to_array($xpath->query("//*[@class='$class']/text()")),
            'nodeValue'
        )
    );
    echo "\n---\n";
}

Output:

array (
  0 => 'Chapter 1',
  1 => 'Chapter 2',
  2 => 'Chapter 3',
)
---
array (
  0 => 'This is chapter 1',
  1 => 'This is chapter 2',
  2 => 'This is chapter 3',
)
---

Comments

0

The DOMDocument answers all use XPath, but XPath syntax can be intimidating for new users and for simple processing like this it isn't necessary.

$html_string = <<< HTML
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html_string);

foreach($dom->getElementsByTagName('span') as $element) {
    $class = $element->getAttribute('class');
    if ($class === 'Heading1-H') {
        $heading[] = $element->textContent;
    } elseif($class === 'Normal-H') {
        $content[] = $element->textContent;
    }
}
print_r($heading);
print_r($content);

Note when looking for a class in particular, a better check would be something like preg_match('\bNormal-H\b', $class) to account for the possibility of multiple items in the class list.

Comments

-13

// Create DOM from URL or file

$html = file_get_html('http://www.google.com/');

// Find all images

foreach($html->find('img') as $element) 
   echo $element->src . '<br>';

// Find all links

foreach($html->find('a') as $element) 
   echo $element->href . '<br>';

2 Comments

file_get_html ?? Is that a PHP function ?
file_get_content is right. he has copy past from php simple dom website

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.