Get the text from all elements with a nominated class as a flat array

Question

I know we can use PHP DOM to parse HTML using PHP, but I have a specific requirement. I have an HTML content like below

<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>

I want to parse the above HTML and save the content into two different arrays like:

$heading and $content

$heading = array('Chapter 1', 'Chapter 2', 'Chapter 3');
$content = array('This is chapter 1', 'This is chapter 2', 'This is chapter 3');

I can achieve this simply using jQuery. But I am not sure, if that's the right way.

@Susheel: HTML content will be much bigger as it is the output after parsing docx files — laradev
– laradev, Commented Aug 21, 2013 at 5:00
You could use regular expressions if you don't like to go for PHP DOM. — Lorenz Meyer
– Lorenz Meyer, Commented Aug 21, 2013 at 5:00

miken32 · Accepted Answer · 2024-07-10 17:29:35Z

31

I have used DOMDocument and DOMXPath to get the solution:

$test = <<< HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($test);
$xpath = new DOMXPath($dom);
$heading = parseToArray($xpath,'Heading1-H');
$content = parseToArray($xpath,'Normal-H');

var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";

function parseToArray(DOMXPath $xpath, string $class): array
{
    $xpathquery = "//*[@class='$class']";
    $elements = $xpath->query($xpathquery);

    $resultarray = [];
    foreach ($elements as $element) {
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
          $resultarray[] = $node->nodeValue;
        }
    }

    return $resultarray;
}

edited Jul 10, 2024 at 17:29

miken32

42.5k16 gold badges127 silver badges177 bronze badges

answered Aug 21, 2013 at 5:45

saji89

2,2614 gold badges28 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nigini Over a year ago

I've found this link to be very useful to learn the XPATH.query syntax: w3schools.com/xml/xpath_syntax.asp

iniravpatel · Accepted Answer · 2018-11-21 14:33:40Z

24

Try to look at PHP Simple HTML DOM Parser

It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class

// include/require the simple html dom parser file

$html_string = '
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
    if ($element->class === 'Heading1-H') {
        $heading[] = $element->innertext;
    }else if($element->class === 'Normal-H') {
        $content[] = $element->innertext;
    }
}

edited Nov 21, 2018 at 14:33

iniravpatel

1,68318 silver badges24 bronze badges

answered Aug 21, 2013 at 4:58

Paul Denisevich

2,41216 silver badges19 bronze badges

7 Comments

Mahdi Rafatjah Over a year ago

!!NOTICE!! not using "->innertext" leads to memory leaks.

Stephen G Over a year ago

This is a much easier option and produces more readable code compared to using DomDocument.

luckydonald Over a year ago

Is there an option to install that with composer?

luckydonald Over a year ago

Composer install is now possible: composer require simplehtmldom/simlehtmldom dev-master and use simplehtmldom\HtmlWeb;

Philip Over a year ago

@luckydonald there is a typo in your comment. missing the "p" in the second "simple" in the composer require command

|

miken32 · Accepted Answer · 2024-07-10 17:33:44Z

12

Here's an alternative way to parse the html using DiDOM.

composer require imangazaliev/didom

<?php

use DiDom\Document;

require_once('vendor/autoload.php');

$html = <<<HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$document = new Document($html);

// find chapter headings
$elements = $document->find('.Heading1-H');

$headings = [];

foreach ($elements as $element) {
    $headings[] = $element->text();
}

// find chapter texts
$elements = $document->find('.Normal-H');

$chapters = [];

foreach ($elements as $element) {
    $chapters[] = $element->text();
}

echo("Headings\n");

foreach ($headings as $heading) {
    echo("- {$heading}\n");
}

echo("Chapter texts\n");

foreach ($chapters as $chapter) {
    echo("- {$chapter}\n");
}

edited Jul 10, 2024 at 17:33

miken32

42.5k16 gold badges127 silver badges177 bronze badges

answered Dec 25, 2020 at 6:11

8ctopus

3,3673 gold badges24 silver badges30 bronze badges

4 Comments

8ctopus Over a year ago

@miken32 why the edit?

miken32 Over a year ago

Because this isn't an advertisement, micro-optimizations are typically a waste of time, and everyone already knows Simple HTML DOM is trash.

miken32 Over a year ago

Or at least they should by now lol

8ctopus Over a year ago

Respectfully disagree.

Greeso · Accepted Answer · 2018-07-16 19:03:14Z

6

One option for you is to use DOMDocument and DOMXPath. They do require a bit of a curve to learn, but once you do, you will be pretty happy with what you can achieve.

Read the following in php.net

http://php.net/manual/en/class.domdocument.php

http://php.net/manual/en/class.domxpath.php

Hope this helps.

edited Jul 16, 2018 at 19:03

answered Aug 21, 2013 at 5:00

Greeso

8,42915 gold badges58 silver badges86 bronze badges

1 Comment

Mahdi Rafatjah Over a year ago

This has problem with broken html

mickmackusa · Accepted Answer · 2024-07-10 23:51:21Z

Here is the functional-style equivalent of @saji89's answer. Search for any element on any level which has the desired class (use contains() if there may be multiple classes assigned to an element), then target the node text with text(). After converting the XPath object to an array, simply isolate the nodeValue column.

Code: (Demo)

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach (['Heading1-H', 'Normal-H'] as $class) {
    var_export(
        array_column(
            iterator_to_array($xpath->query("//*[@class='$class']/text()")),
            'nodeValue'
        )
    );
    echo "\n---\n";
}

Output:

array (
  0 => 'Chapter 1',
  1 => 'Chapter 2',
  2 => 'Chapter 3',
)
---
array (
  0 => 'This is chapter 1',
  1 => 'This is chapter 2',
  2 => 'This is chapter 3',
)
---

miken32 · Accepted Answer · 2024-07-11 21:55:31Z

The DOMDocument answers all use XPath, but XPath syntax can be intimidating for new users and for simple processing like this it isn't necessary.

$html_string = <<< HTML
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html_string);

foreach($dom->getElementsByTagName('span') as $element) {
    $class = $element->getAttribute('class');
    if ($class === 'Heading1-H') {
        $heading[] = $element->textContent;
    } elseif($class === 'Normal-H') {
        $content[] = $element->textContent;
    }
}
print_r($heading);
print_r($content);

Note when looking for a class in particular, a better check would be something like preg_match('\bNormal-H\b', $class) to account for the possibility of multiple items in the class list.

Chen-Tsu Lin · Accepted Answer · 2014-03-05 08:13:13Z

-13

// Create DOM from URL or file

$html = file_get_html('http://www.google.com/');

// Find all images

foreach($html->find('img') as $element) 
   echo $element->src . '<br>';

// Find all links

foreach($html->find('a') as $element) 
   echo $element->href . '<br>';

edited Mar 5, 2014 at 8:13

Chen-Tsu Lin

23.3k16 gold badges57 silver badges65 bronze badges

answered Mar 5, 2014 at 7:55

jfraber

6151 gold badge5 silver badges6 bronze badges

2 Comments

everydayapps Over a year ago

file_get_html ?? Is that a PHP function ?

Mohammad Alipour Over a year ago

file_get_content is right. he has copy past from php simple dom website

Collectives™ on Stack Overflow

Get the text from all elements with a nominated class as a flat array

7 Answers 7

1 Comment

7 Comments

4 Comments

1 Comment

Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

1 Comment

7 Comments

4 Comments

1 Comment

Comments

Comments

2 Comments

Linked

Related