1

I try to crawl rugdoc.io. When I do so manually, I first see a page which says Checking your browser. Just wait a moment. Then, after a second or so, the actual content gets displayed. When I do so with selenium, I always stay on the wait a moment page.
How can it work manually but not with selenium? How can rugdoc.io know that the webpage is accessed automatically? Does selenium open Chrome with some extra options? My code:

from selenium import webdriver

options = webdriver.ChromeOptions()
options.binary_location = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
driver = webdriver.Chrome(executable_path="/Users/lukas.denk/Downloads/chromedriver", chrome_options=options)

driver.get("https://rugdoc.io/")
time.sleep(10)

#still the "just wait a moment" webpage
loaded_webpage_should_be_here=driver.page_source

Chrome version: 100.0.4896.127 (arm64).
ChromeDriver version: 100.0.4896.60.
MacOs: 12.3.1 - with M1 Max.
selenium version: 4.1.3.
Python version: 3.8

EDIT: It may have sth to do that selenium has problems with webpages that are redirecting to another webpage (see e.g., here). When I visit rugdoc.io, it seems to redirect me to https://rugdoc.io/?__cf_chl_tk=hkaULMeBxwgnTv0SgwmOY62fuDatlRLnupbDymXWWs0-1650454179-0-gaNycGzNBpE and then back to rugdoc.io.
However, the solution in the stackoverflow link proposes to use a driver.navigate().to() function which does not exist in the python selenium.

0

1 Answer 1

1

Had to run your code to understand your problem :(

The issue you are running into is the DDoS CloudFare protection that won't allow for webdriver requests to go through to protect the site against automatic requests and DDoS :)

You can check this webdriver alternative that doesn't have those restrictions: https://github.com/ultrafunkamsterdam/undetected-chromedriver

Sign up to request clarification or add additional context in comments.

2 Comments

But do you know how it detects that I am crawling with webdriver? Since webdriver just opens the webpage once, there is no reason for the webdriver to behave differently from a real person. At least in theory.
There are several ways that this detection occurs. Check this blog post for some examples and how to (manually) bypass them: piprogramming.org/articles/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.