How to implement a web scraper in PHP? [closed]

Question

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed 10 years ago.

Improve this question

What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?

I'd like to recommend this class I recently came across. Simple HTML DOM Parser — KrzysztofPrugar
– KrzysztofPrugar, Commented Apr 21, 2009 at 7:43
PHP is a particularly bad language for this. It lacks an event driven framework which is almost necessary for this task. Can you crawl one site with it -- yes. Will you ever crawl a lot of sites well? No. — Evan Carroll
– Evan Carroll, Commented Aug 30, 2010 at 21:07
@EvanCarroll Will cURL and DOMdocument be suitable for scraping price and image of products from multiple websites (to output on my website)? For example this Stackoverflow link If not, what would you suggest? — stadisco
– stadisco, Commented Jun 19, 2015 at 5:26
Just try it, if it works it's good enough for you. Node is a much better choice for building a web scraper. Also, Phantom.JS (if you need something modern that actually has a dom and runs the javascript on it). — Evan Carroll
– Evan Carroll, Commented Jun 22, 2015 at 19:33

Zevi Sternlicht · Accepted Answer · 2017-09-10 10:58:48Z

50

Scraping generally encompasses 3 steps:

first you GET or POST your request to a specified URL
next you receive the html that is returned as the response
finally you parse out of that html the text you'd like to scrape.

To accomplish steps 1 and 2, below is a simple php class which uses Curl to fetch webpages using either GET or POST. After you get the HTML back, you just use Regular Expressions to accomplish step 3 by parsing out the text you'd like to scrape.

For regular expressions, my favorite tutorial site is the following: Regular Expressions Tutorial

My Favorite program for working with RegExs is Regex Buddy. I would advise you to try the demo of that product even if you have no intention of buying it. It is an invaluable tool and will even generate code for your regexs you make in your language of choice (including php).

Usage:



$curl = new Curl();
$html = $curl->get("http://www.google.com");

// now, do your regex work against $html

PHP Class:



<?php

class Curl
{       

    public $cookieJar = "";

    public function __construct($cookieJarFile = 'cookies.txt') {
        $this->cookieJar = $cookieJarFile;
    }

    function setup()
    {


        $header = array();
        $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
        $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] =  "Cache-Control: max-age=0";
        $header[] =  "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank.


        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
        curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
        curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); 
        curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);
        curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
        curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);  
    }


    function get($url)
    { 
        $this->curl = curl_init($url);
        $this->setup();

        return $this->request();
    }

    function getAll($reg,$str)
    {
        preg_match_all($reg,$str,$matches);
        return $matches[1];
    }

    function postForm($url, $fields, $referer='')
    {
        $this->curl = curl_init($url);
        $this->setup();
        curl_setopt($this->curl, CURLOPT_URL, $url);
        curl_setopt($this->curl, CURLOPT_POST, 1);
        curl_setopt($this->curl, CURLOPT_REFERER, $referer);
        curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
        return $this->request();
    }

    function getInfo($info)
    {
        $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
        return $info;
    }

    function request()
    {
        return curl_exec($this->curl);
    }
}

?>

edited Sep 10, 2017 at 10:58

Zevi Sternlicht

5,39921 silver badges31 bronze badges

answered Sep 19, 2008 at 16:40

tyshock

1,3218 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Xiong Chiamiov Over a year ago

Mm, parsing html with regexes is... well, I'll just let this guy explain: stackoverflow.com/questions/1732348/…

CedSha Over a year ago

curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar);   curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);

CedSha Over a year ago

Also if one need to get or post several form from same website there will be a problem with the $this->curl = curl_init($url); which will open a new session each time. This init is used in the get function and the postForm function

youngcouple10 Over a year ago

Excellent code. Mistake with cookiejar and cookiefile though, replace $cookieJar with $this->$cookieJar

Celmaun · Accepted Answer · 2012-10-16 19:10:10Z

I recommend Goutte, a simple PHP Web Scraper.

Example Usage:-

Create a Goutte Client instance (which extends Symfony\Component\BrowserKit\Client):

use Goutte\Client;

$client = new Client();

Make requests with the request() method:

$crawler = $client->request('GET', 'http://www.symfony-project.org/');

The request method returns a Crawler object (Symfony\Component\DomCrawler\Crawler).

Click on links:

$link = $crawler->selectLink('Plugins')->link();
$crawler = $client->click($link);

Submit forms:

$form = $crawler->selectButton('sign in')->form();
$crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx'));

Extract data:

$nodes = $crawler->filter('.error_list');

if ($nodes->count())
{
  die(sprintf("Authentification error: %s\n", $nodes->text()));
}

printf("Nb tasks: %d\n", $crawler->filter('#nb_tasks')->text());

Joe Niland · Accepted Answer · 2010-09-24 04:50:43Z

11

ScraperWiki is a pretty interesting project. Helps you build scrapers online in Python, Ruby or PHP - i was able to get a simple attempt up in a few minutes.

answered Sep 24, 2010 at 4:50

Joe Niland

9191 gold badge15 silver badges32 bronze badges

1 Comment

Pere Over a year ago

As of January 2023, "QUICKCODE HAS NOW BEEN DECOMMISSIONED"

troelskn · Accepted Answer · 2008-09-19 21:49:25Z

2

If you need something that is easy to maintain, rather than fast to execute, it could help to use a scriptable browser, such as SimpleTest's.

answered Sep 19, 2008 at 21:49

troelskn

118k27 gold badges135 silver badges156 bronze badges

Comments

adiian · Accepted Answer · 2010-06-19 13:41:44Z

1

here is another one: a simple PHP Scraper without Regex.

answered Jun 19, 2010 at 13:41

adiian

1,3822 gold badges15 silver badges32 bronze badges

Comments

Brian Warshaw · Accepted Answer · 2008-08-25 21:31:03Z

0

file_get_contents() can take a remote URL and give you the source. You can then use regular expressions (with the Perl-compatible functions) to grab what you need.

Out of curiosity, what are you trying to scrape?

answered Aug 25, 2008 at 21:31

Brian Warshaw

23k9 gold badges55 silver badges72 bronze badges

Comments

dlamblin · Accepted Answer · 2008-08-25 21:39:43Z

0

I'd either use libcurl or Perl's LWP (libwww for perl). Is there a libwww for php?

answered Aug 25, 2008 at 21:39

dlamblin

45.6k22 gold badges105 silver badges144 bronze badges

2 Comments

Andy Lester Over a year ago

If you're going to use LWP, use WWW::Mechanize, which wraps it with handy helper functions.

iconoclast Over a year ago

Mechanize is also available for Ruby, if you're open to things other than PHP.

Sarfraz · Accepted Answer · 2009-12-26 06:19:02Z

Scraper class from my framework:

<?php

/*
    Example:

    $site = $this->load->cls('scraper', 'http://www.anysite.com');
    $excss = $site->getExternalCSS();
    $incss = $site->getInternalCSS();
    $ids = $site->getIds();
    $classes = $site->getClasses();
    $spans = $site->getSpans(); 

    print '<pre>';
    print_r($excss);
    print_r($incss);
    print_r($ids);
    print_r($classes);
    print_r($spans);        

*/

class scraper
{
    private $url = '';

    public function __construct($url)
    {
        $this->url = file_get_contents("$url");
    }

    public function getInternalCSS()
    {
        $tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getExternalCSS()
    {
        $tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getIds()
    {
        $tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getClasses()
    {
        $tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getSpans(){
        $tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

}
?>

Peter Stuifzand · Accepted Answer · 2008-08-25 21:30:01Z

-2

The curl library allows you to download web pages. You should look into regular expressions for doing the scraping.

answered Aug 25, 2008 at 21:30

Peter Stuifzand

5,1041 gold badge26 silver badges28 bronze badges

1 Comment

Vince V. Over a year ago

-1 for recommending regex! Use an HTML parser.

Collectives™ on Stack Overflow

How to implement a web scraper in PHP? [closed]

9 Answers 9

4 Comments

Example Usage:-

Comments

1 Comment

Comments

Comments

Comments

2 Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

4 Comments

Example Usage:-

Comments

1 Comment

Comments

Comments

Comments

2 Comments

Comments

1 Comment

Linked

Related