1

So i started learning web scraping in python using urllib and bs4,

I was searching for a code to analyze and i found this:- https://stackoverflow.com/a/38620894/14252018 here is the code:-

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

When i try to run this it does not print anything

I named it as webparse.py

So then i tried using bs4 and this time i chose https://www.duckduckgo.com

and changed the code to this:-

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('https://duckduckgo.com/?q=dinosaur&t=h_&ia=web').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print(soup.get_text())

I got an error:-

  1. Why didn't the first block of code run?
  2. why did the second block of code gave me an error? and what does that error mean?
9
  • Perhaps try cssselect(".r.a") if you're searching for elements with class="r a" or class="a r" Commented Sep 11, 2020 at 14:22
  • and why did the second bloack of code gave an error, and what does that mean? Commented Sep 11, 2020 at 14:30
  • Why do you assume that the duckduckgo message was an error? The message just shows that duckduckgo detected that javascript is not understood and that duckduckgo is redirecting you to a different page. Commented Sep 11, 2020 at 14:33
  • But it did not print anything other than that Commented Sep 11, 2020 at 14:35
  • What else did you expect the 2nd block of code to print out? Commented Sep 11, 2020 at 14:36

1 Answer 1

0

Change your duckduckgo URL to where the site tries to redirect you when javascript is not enabled.

import bs4 as bs
import urllib.request

# url = 'https://duckduckgo.com/?q=dinosaur&t=h_&ia=web' # uses javascript
url = 'https://html.duckduckgo.com/html?q=dinosaur' # no javascript

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print(soup.get_text())


Sign up to request clarification or add additional context in comments.

1 Comment

Because nothing matched your CSS selector. Google shows different pages depending on whether javascript is enabled or not. Neither urllib nor requests do javascript.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.