0

I'm trying to create a webscraper which will get links from Google search result page. Everything works fine, but I want to search a specific site only, i.e., instead of test, I want to search for site:example.com test. The following is my current code:

import requests,re
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs

s_term=input("Enter search term: ").replace(" ","+")
r = requests.get('http://www.google.com/search', params={'q':'"'+s_term+'"','num':"50","tbs":"li:1"})

soup = BeautifulSoup(r.content,"html.parser")

links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    links.append(item.a['href'])

print(links)

I tried using: ...params={'q':'"site%3Aexample.com+'+s_term+'"'... but it returns 0 results.

2
  • r.status_code == 503? Goodle doesn't like bots or google dorks.. Commented Aug 19, 2017 at 19:24
  • No, its returning 200. But it displays, Your search - <b>"site%3Atwitter.com+Manikiran"</b> - did not match any documents. Commented Aug 19, 2017 at 19:30

2 Answers 2

2

Change your existing params to the below one:

params={"source":"hp","q":"site:example.com test","oq":"site:example.com test","gs_l":"psy-ab.12...10773.10773.0.22438.3.2.0.0.0.0.135.221.1j1.2.0....0...1.2.64.psy-ab..1.1.135.6..35i39k1.zWoG6dpBC3U"}
Sign up to request clarification or add additional context in comments.

Comments

1

You only need "q" params. Also, make sure you're using user-agent because Google might block your requests eventually thus you'll receive a completely different HTML. I already answered what is user-agent here.

Pass params:

params = {
  "q": "site:example.com test"
}

requests.get("YOUR_URL", params=params)

Pass user-agent:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get(YOUR_URL, headers=headers)

Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "site:example.com test"
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

# http://example.com/

Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to figure out how to make stuff work since it's already done for the end-user and the only thing that needs to be done is to iterate over structured JSON and get what you want.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "site:example.com test",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

# http://example.com/

Disclaimer, I work for SerpApi.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.