11

I am a beginner in python,im just trying to scrape web with module requests and BeautifulSoup This Website i make request.

and my simple code:

import requests, time, re, json
from bs4 import BeautifulSoup as BS

url = "https://www.jobstreet.co.id/en/job-search/job-vacancy.php?ojs=6"

def list_jobs():
    try:
        with requests.session() as s:
            st = time.time()
            s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}
            req = s.get(url)
            soup = BS(req.text,'html.parser')
            attr = soup.findAll('div',class_='position-title header-text')
            pttr = r".?(.*)Rank=\d+"
            lists = {"status":200,"result":[]}
            for a in attr:
                sr = re.search(pttr, a.find("a")["href"])
                if sr:
                    title = a.find('a')['title'].replace("Lihat detil lowongan -","").replace("\r","").replace("\n","")
                    url = a.find('a')['href']
                    lists["result"].append({
                        "title":title,
                        "url":url,
                        "detail":detail_jobs(url)
                    })
            print(json.dumps(lists, indent=4))
            end = time.time() - st
            print(f"\n{end} second")
    except:
        pass

def detail_jobs(find_url):
    try:
        with requests.session() as s:
            s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}
            req = s.get(find_url)
            soup = BS(req.text,'html.parser')
            position = soup.find('h1',class_='job-position').text
            name = soup.find('div',class_='company_name').text.strip("\t")
            try:
                addrs = soup.find('div',class_='map-col-wraper').find('p',{'id':'address'}).text
            except Exception:
                addrs = "Unknown"
            try:
                loct = soup.find('span',{'id':'single_work_location'}).text
            except Exception:
                loct = soup.find('span',{'id':'multiple_work_location_list'}).find('span',{'class':'show'}).text        
            dests = soup.findAll('div',attrs={'id':'job_description'})
            for select in dests:
                txt = select.text if not select.text.startswith("\n") or not select.text.endswith("\n") else select.text.replace("\n","")
                result = {
                    "name":name,
                    "location":loct,
                    "position":position,
                    "description":txt,
                    "address":addrs
                }
                return result
    except:
        pass

they all work well but take very long to show results time is always above 13/17 seconds

i dont know how to increase my speed for requesting

I tried search on stack and google,they said using asyncio but the way so hard to me.

if someone have simple trick how to increase speed with simple do,im so appreciate ..

And sorry for my bad English

4 Answers 4

20

Learning Python through projects such as web scraping is awesome. That is how I was introduced to Python. That said, to increase the speed of your scraping, you can do three things:

  1. Change the html parser to something faster. 'html.parser' is the slowest of them all. Try change to 'lxml' or 'html5lib'. (read https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

enter image description here

  1. Drop the loops and regex as they slow your script. Just use BeautifulSoup tools, text and strip, and find the right tags.(see my script below)

  2. Since the bottleneck in web scraping is usually IO, waiting to get data from a webpage, using async or multithread will boost speed. In the below script, I have use multithreading. The aim is to pull data from multiple pages at the same time.

So if we know maximum number of pages, we can chunk our requests into different ranges and pull them in batches :)

Code example:

from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime

import requests
from bs4 import BeautifulSoup as bs

data = defaultdict(list)

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}

def get_data(data, headers, page=1):

    # Get start time
    start_time = datetime.now()
    url = f'https://www.jobstreet.co.id/en/job-search/job-vacancy/{page}/?src=20&srcr=2000&ojs=6'
    r = requests.get(url, headers=headers)

    # If the requests is fine, proceed
    if r.ok:
        jobs = bs(r.content,'lxml').find('div',{'id':'job_listing_panel'})
        data['title'].extend([i.text.strip() for i in jobs.find_all('div',{'class':'position-title header-text'})])
        data['company'].extend([i.text.strip() for i in jobs.find_all('h3',{'class':'company-name'})])
        data['location'].extend([i['title'] for i in jobs.find_all('li',{'class':'job-location'})] )
        data['desc'].extend([i.text.strip() for i in jobs.find_all('ul',{'class':'list-unstyled hidden-xs '})])
    else:
        print('connection issues')
    print(f'Page: {page} | Time taken {datetime.now()-start_time}')
    return data
    

def multi_get_data(data,headers,start_page=1,end_page=20,workers=20):
    start_time = datetime.now()
    # Execute our get_data in multiple threads each having a different page number
    with ThreadPoolExecutor(max_workers=workers) as executor:
        [executor.submit(get_data, data=data,headers=headers,page=i) for i in range(start_page,end_page+1)]
    
    print(f'Page {start_page}-{end_page} | Time take {datetime.now() -     start_time}')
    return data


# Test page 10-15
k = multi_get_data(data,headers,start_page=10,end_page=15)

Results: enter image description here

Explaining the multi_get_data function:

This function will call get_data function in different threads with passing desired arguments. At the moment, each thread get a different page number to call. The maximum numbers of workers is set to 20, meaning 20 threads. You can increase or decrease accordingly.

We have created variable data, a default dictionary, that takes lists in. All threads will populate this data. This variable can then be cast to json or Pandas DataFrame :)

As you can see, we have 5 requests, each taking less than 2 seconds but yet the total is still under 2 seconds;)

Enjoy web scraping.

Update_: 22/12/2019

We could also gain some speed by using session with a single headers update. So we don’t have to start sessions with each call.

from requests import Session

s = Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
                         'AppleWebKit/537.36 (KHTML, like Gecko) '\
                         'Chrome/75.0.3770.80 Safari/537.36'}
# Add headers
s.headers.update(headers)

# we can use s as we do requests
# s.get(...)
...
Sign up to request clarification or add additional context in comments.

3 Comments

Excellent Prayson!
Its so amazing,i'll try it and improved it i want to know how to make a simple do request but with faster speed,maybe i can use your tricky ;)
Play with and tune it to fit your need. It gathered the same data in your example. I drop the printing of data as it’s slow the process. You can print your data afterwards e.g. import pandas as pd and add k in as df = pd.DataFrame(k). Now you can print print(df.head()) and load your data to csv as df.to_csv('data.csv', index=False,sep=';')
2

The bottleneck is the server responding slowly to a simple requests.

Try doing request parallel.

You can also use threads instead of asyncio. Here is a previous question explain for to parallerise tasks in Python:

Executing tasks in parallel in python

Please note that a smartly configured server would still slow your requests or ban you if you are scraping without a permission.

Comments

2

This is my suggestion to write your code with good architecture and divide it into functions and write a less code. This is one of example using request:

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

Debug your code on the points which take time and figure out them and discuss here. That way help you lot to solve your problem.

Comments

0

Try using scrapy it will handle website communication (request / response) for you.

If you make to many requests you will be blocked, they are using Cloudflare products

1 Comment

It would be nice to add a simple example on how to use scrapy ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.