I am currently writing python code that scrapes information from the web. I have to scrape several sites, but there a two types of procedures:
- Directly scrape from the website
- Download pdf and scrape it with regexes
I consider the following 3 options, which one would be recommended?
Option 1 Use inheritance:
import requests
import PyPDF2
from bs4 import BeautifulSoup
import re
class Scraper:
def __init__(self, name):
self.name = name
self.url = None
def get_text_from_pdf(self, page_number):
self.download_pdf(self.url, './data/{}.pdf'.format(self.name))
mypdf = open('./data/{}.pdf'.format(self.name), 'rb')
fileReader = PyPDF2.PdfFileReader(mypdf)
page = fileReader.getPage(page_number)
text = page.extractText()
return text
def download_pdf(self, self.url, path):
response = requests.get(self.url)
with open(path, 'wb') as f:
f.write(response.content)
def get_soup_from_url(self):
response = requests.get(self.url)
result = response.text
soup = BeautifulSoup(result)
return soup
class Website1(Scraper):
def __init__(self):
super().__init__('website1')
self.url = 'https://website1.com'
def get_info(self, soup):
'''
Parse the html code through Beautifullsoup
'''
class Website2(Scraper):
def __init__(self):
super().__init__('website2')
self.url = 'https://website2.com/some_pdf.pdf'
def get_info(self, text):
'''
Parse the pdf text through regexes
'''
if __name__ == "__main__":
Website1_Scraper = Website1()
raw_info = Website1_Scraper.get_soup_from_url()
Website1_Scraper.get_info(raw_info)
Website2_Scraper = Website2()
raw_info = Website2_Scraper.get_text_from_pdf(page_number=0)
Website2_Scraper.get_info(raw_info)
#Website3_Scraper, 4, 5 ... 10
Option 2
Only use the sub classes Website1 and Website2and convert the methods of the Scraper class to regular functions
Option 3
Delete all classes and only use functions, such as: get_info_from_website1() and get_info_from_website2()
Scrapyif you want object oriented style scrapping. I have already suggested this on the few answers that I have given. \$\endgroup\$