What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?
-
I'd like to recommend this class I recently came across. Simple HTML DOM ParserKrzysztofPrugar– KrzysztofPrugar2009-04-21 07:43:48 +00:00Commented Apr 21, 2009 at 7:43
-
8PHP is a particularly bad language for this. It lacks an event driven framework which is almost necessary for this task. Can you crawl one site with it -- yes. Will you ever crawl a lot of sites well? No.Evan Carroll– Evan Carroll2010-08-30 21:07:20 +00:00Commented Aug 30, 2010 at 21:07
-
@EvanCarroll Will cURL and DOMdocument be suitable for scraping price and image of products from multiple websites (to output on my website)? For example this Stackoverflow link If not, what would you suggest?stadisco– stadisco2015-06-19 05:26:34 +00:00Commented Jun 19, 2015 at 5:26
-
Just try it, if it works it's good enough for you. Node is a much better choice for building a web scraper. Also, Phantom.JS (if you need something modern that actually has a dom and runs the javascript on it).Evan Carroll– Evan Carroll2015-06-22 19:33:51 +00:00Commented Jun 22, 2015 at 19:33
9 Answers
Scraping generally encompasses 3 steps:
- first you GET or POST your request to a specified URL
- next you receive the html that is returned as the response
- finally you parse out of that html the text you'd like to scrape.
To accomplish steps 1 and 2, below is a simple php class which uses Curl to fetch webpages using either GET or POST. After you get the HTML back, you just use Regular Expressions to accomplish step 3 by parsing out the text you'd like to scrape.
For regular expressions, my favorite tutorial site is the following: Regular Expressions Tutorial
My Favorite program for working with RegExs is Regex Buddy. I would advise you to try the demo of that product even if you have no intention of buying it. It is an invaluable tool and will even generate code for your regexs you make in your language of choice (including php).
Usage:
$curl = new Curl();
$html = $curl->get("http://www.google.com");
// now, do your regex work against $html
PHP Class:
<?php
class Curl
{
public $cookieJar = "";
public function __construct($cookieJarFile = 'cookies.txt') {
$this->cookieJar = $cookieJarFile;
}
function setup()
{
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar);
curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);
curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);
}
function get($url)
{
$this->curl = curl_init($url);
$this->setup();
return $this->request();
}
function getAll($reg,$str)
{
preg_match_all($reg,$str,$matches);
return $matches[1];
}
function postForm($url, $fields, $referer='')
{
$this->curl = curl_init($url);
$this->setup();
curl_setopt($this->curl, CURLOPT_URL, $url);
curl_setopt($this->curl, CURLOPT_POST, 1);
curl_setopt($this->curl, CURLOPT_REFERER, $referer);
curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
return $this->request();
}
function getInfo($info)
{
$info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
return $info;
}
function request()
{
return curl_exec($this->curl);
}
}
?>
4 Comments
curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);$this->curl = curl_init($url); which will open a new session each time. This init is used in the get function and the postForm function$cookieJar with $this->$cookieJarI recommend Goutte, a simple PHP Web Scraper.
Example Usage:-
Create a Goutte Client instance (which extends
Symfony\Component\BrowserKit\Client):
use Goutte\Client;
$client = new Client();
Make requests with the request() method:
$crawler = $client->request('GET', 'http://www.symfony-project.org/');
The request method returns a Crawler object
(Symfony\Component\DomCrawler\Crawler).
Click on links:
$link = $crawler->selectLink('Plugins')->link();
$crawler = $client->click($link);
Submit forms:
$form = $crawler->selectButton('sign in')->form();
$crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx'));
Extract data:
$nodes = $crawler->filter('.error_list');
if ($nodes->count())
{
die(sprintf("Authentification error: %s\n", $nodes->text()));
}
printf("Nb tasks: %d\n", $crawler->filter('#nb_tasks')->text());
Comments
ScraperWiki is a pretty interesting project. Helps you build scrapers online in Python, Ruby or PHP - i was able to get a simple attempt up in a few minutes.
1 Comment
If you need something that is easy to maintain, rather than fast to execute, it could help to use a scriptable browser, such as SimpleTest's.
Comments
I'd either use libcurl or Perl's LWP (libwww for perl). Is there a libwww for php?
2 Comments
Scraper class from my framework:
<?php
/*
Example:
$site = $this->load->cls('scraper', 'http://www.anysite.com');
$excss = $site->getExternalCSS();
$incss = $site->getInternalCSS();
$ids = $site->getIds();
$classes = $site->getClasses();
$spans = $site->getSpans();
print '<pre>';
print_r($excss);
print_r($incss);
print_r($ids);
print_r($classes);
print_r($spans);
*/
class scraper
{
private $url = '';
public function __construct($url)
{
$this->url = file_get_contents("$url");
}
public function getInternalCSS()
{
$tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns);
$result = array();
array_push($result, $patterns[2]);
array_push($result, count($patterns[2]));
return $result;
}
public function getExternalCSS()
{
$tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns);
$result = array();
array_push($result, $patterns[2]);
array_push($result, count($patterns[2]));
return $result;
}
public function getIds()
{
$tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns);
$result = array();
array_push($result, $patterns[2]);
array_push($result, count($patterns[2]));
return $result;
}
public function getClasses()
{
$tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns);
$result = array();
array_push($result, $patterns[2]);
array_push($result, count($patterns[2]));
return $result;
}
public function getSpans(){
$tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns);
$result = array();
array_push($result, $patterns[2]);
array_push($result, count($patterns[2]));
return $result;
}
}
?>
Comments
The curl library allows you to download web pages. You should look into regular expressions for doing the scraping.