Error while parsing google search result using urllib in python

Question

So i started learning web scraping in python using urllib and bs4,

I was searching for a code to analyze and i found this:- https://stackoverflow.com/a/38620894/14252018 here is the code:-

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

When i try to run this it does not print anything

So then i tried using bs4 and this time i chose https://www.duckduckgo.com

and changed the code to this:-

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('https://duckduckgo.com/?q=dinosaur&t=h_&ia=web').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print(soup.get_text())

I got an error:-

Why didn't the first block of code run?
why did the second block of code gave me an error? and what does that error mean?

Perhaps try cssselect(".r.a") if you're searching for elements with class="r a" or class="a r" — user5386938
– user5386938, Commented Sep 11, 2020 at 14:22
and why did the second bloack of code gave an error, and what does that mean? — Praveen
– Praveen, Commented Sep 11, 2020 at 14:30
Why do you assume that the duckduckgo message was an error? The message just shows that duckduckgo detected that javascript is not understood and that duckduckgo is redirecting you to a different page. — user5386938
– user5386938, Commented Sep 11, 2020 at 14:33
What else did you expect the 2nd block of code to print out? — user5386938
– user5386938, Commented Sep 11, 2020 at 14:36

user5386938 · Accepted Answer · 2020-09-11 15:29:12Z

0

Change your duckduckgo URL to where the site tries to redirect you when javascript is not enabled.

import bs4 as bs
import urllib.request

# url = 'https://duckduckgo.com/?q=dinosaur&t=h_&ia=web' # uses javascript
url = 'https://html.duckduckgo.com/html?q=dinosaur' # no javascript

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print(soup.get_text())

answered Sep 11, 2020 at 15:29

user5386938

Sign up to request clarification or add additional context in comments.

1 Comment

user5386938 Over a year ago

Because nothing matched your CSS selector. Google shows different pages depending on whether javascript is enabled or not. Neither urllib nor requests do javascript.

Collectives™ on Stack Overflow

Error while parsing google search result using urllib in python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related