Google Search can be parsed with BeautifulSoup web scraping library without selenium, since the data is not being loaded dynamically via JavaScript, and will execute much faster in comparison to selenium as there's no need to render the page and use browser.
In order to get information from all pages, you can use pagination using an infinite while loop. Try to avoid using for i in range() pagination as it is a hardcoded way of doing pagination thus not reliable. If the page number would change (from 5 to 20), pagination will be broken.
Since the while loop is infinite, you need to set the conditions for exiting it, you can make two conditions:
- the exit condition will be the presence of a button to switch to the next page (it is not on the last page), the presence can be checked by its CSS selector (in our case - ".d6cvqb a[id=pnnext]")
# condition for exiting the loop in the absence of the next page button
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
- another solution would be to add a limit of pages available for scraping if there is no need to extract all the pages.
# condition for exiting the loop when the page limit is reached
if page_num == page_limit:
break
When trying to request a site, it may think that this is a bot, so that this does not happen, you need to send headers that contain user-agent in the request, then the site will assume that you are a user and display the information.
Next step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on. The most reliable way is to use rotating proxies, user-agents, and a captcha solver.
Check full code in the online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "cars", # query example
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_limit = 10 # page limit for example
page_num = 0
data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
data.append({
"title": title,
"snippet": snippet,
"links": links
})
# condition for exiting the loop when the page limit is reached
if page_num == page_limit:
break
# condition for exiting the loop in the absence of the next page button
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Cars (2006) - IMDb",
"snippet": "On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town, and learns that winning isn't everything in life.",
"links": "https://www.imdb.com/title/tt0317219/"
},
{
"title": "Cars (film) - Wikipedia",
"snippet": "Cars is a 2006 American computer-animated sports comedy film produced by Pixar Animation Studios and released by Walt Disney Pictures. The film was directed ...",
"links": "https://en.wikipedia.org/wiki/Cars_(film)"
},
{
"title": "Cars - Rotten Tomatoes",
"snippet": "Cars offers visual treats that more than compensate for its somewhat thinly written story, adding up to a satisfying diversion for younger viewers.",
"links": "https://www.rottentomatoes.com/m/cars"
},
other results ...
]
Also you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": "...", # serpapi key from https://serpapi.com/manage-api-key
"engine": "google", # serpapi parser engine
"q": "cars", # search query
"gl": "uk", # country of the search, UK -> United Kingdom
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
page_limit = 10
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet"),
"link": result.get("link")
})
if page_num == page_limit:
break
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "Rally Cars - Page 30 - Google Books result",
"snippet": "Some people say rally car drivers are the most skilled racers in the world. Roger Clark, a British rally legend of the 1970s, describes sliding his car down ...",
"link": "https://books.google.co.uk/books?id=uIOlAgAAQBAJ&pg=PA30&lpg=PA30&dq=cars&source=bl&ots=9vDWFi0bHD&sig=ACfU3U1d4R-ShepjsTtWN-b9SDYkW1sTDQ&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgcEAM"
},
{
"title": "Independent Sports Cars - Page 5 - Google Books result",
"snippet": "The big three American auto makers produced sports and sports-like cars beginning with GMs Corvette and Fords Thunderbird in 1954. Folowed by the Mustang, ...",
"link": "https://books.google.co.uk/books?id=HolUDwAAQBAJ&pg=PA5&lpg=PA5&dq=cars&source=bl&ots=yDaDtQSyW1&sig=ACfU3U11nHeRTwLFORGMHHzWjaVHnbLK3Q&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgaEAM"
}
other results...
]