Do I need to use inheritance and classes for my OOP webscraper? [closed]

Question

Closed. This question is off-topic. It is not currently accepting answers.

Missing Review Context: Code Review requires concrete code from a project, with enough code and / or context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site.

Closed 5 years ago.

Improve this question

I am currently writing python code that scrapes information from the web. I have to scrape several sites, but there a two types of procedures:

Directly scrape from the website
Download pdf and scrape it with regexes

I consider the following 3 options, which one would be recommended?

Option 1 Use inheritance:

import requests
import PyPDF2
from bs4 import BeautifulSoup
import re

class Scraper:
    def __init__(self, name):
        self.name = name
        self.url = None

    def get_text_from_pdf(self, page_number):
        self.download_pdf(self.url, './data/{}.pdf'.format(self.name))
        mypdf = open('./data/{}.pdf'.format(self.name), 'rb')
        fileReader = PyPDF2.PdfFileReader(mypdf)
        page = fileReader.getPage(page_number)
        text = page.extractText()
        return text

    def download_pdf(self, self.url, path):
        response = requests.get(self.url)
        with open(path, 'wb') as f:
            f.write(response.content)

    def get_soup_from_url(self):
        response = requests.get(self.url)
        result = response.text
        soup = BeautifulSoup(result)
        return soup


class Website1(Scraper):
    def __init__(self):
        super().__init__('website1')
        self.url = 'https://website1.com'

    def get_info(self, soup):
        '''
        Parse the html code through Beautifullsoup
        '''


class Website2(Scraper):
    def __init__(self):
        super().__init__('website2')
        self.url = 'https://website2.com/some_pdf.pdf'

    def get_info(self, text):
        '''
        Parse the pdf text through regexes
        '''


if __name__ == "__main__":
    Website1_Scraper = Website1()
    raw_info = Website1_Scraper.get_soup_from_url()
    Website1_Scraper.get_info(raw_info)

    Website2_Scraper = Website2()
    raw_info = Website2_Scraper.get_text_from_pdf(page_number=0)
    Website2_Scraper.get_info(raw_info)

    #Website3_Scraper, 4, 5 ... 10

Option 2

Only use the sub classes Website1 and Website2and convert the methods of the Scraper class to regular functions

Option 3

Delete all classes and only use functions, such as: get_info_from_website1() and get_info_from_website2()

This question is slightly off-topic for code review although the question definitely has merits. The code review site is for open-ended feedback on real working code that you own or maintain. This question is opinion based and that makes it off-topic. What we can do is provide a code review of the code you have written. Please read through our help center for a clearer idea of what we can answer and how we can answer. — pacmaninbw
– pacmaninbw ♦, Commented Jul 5, 2020 at 23:17
Further, you have placeholder/theoretical values ("website1"), which degrade the value of a scraping review. These should refer to the actual website, and your title should say what you're doing ("scraping a directory for disenfranchised octopi"), not your review concerns. — Reinderien
– Reinderien, Commented Jul 6, 2020 at 2:54
You should use Scrapy if you want object oriented style scrapping. I have already suggested this on the few answers that I have given. — Vishesh Mangla
– Vishesh Mangla, Commented Jul 6, 2020 at 6:21
Lol, English is not my first language.Happens often with me. But thanks, I couldn't have detect it otherwise. — Vishesh Mangla
– Vishesh Mangla, Commented Jul 6, 2020 at 15:50

Ben A · Accepted Answer · 2020-07-07 16:16:31Z

0

I see no benefit to any of these approaches. You're just making yourself write more code. All you need is one class that analyzes a website entered from a user. If you have a predefined list of websites that you need to scrape, then iterate over each website instead of using a class for each one.

Since you say there are two ways of accomplishing this task, you can have the user pass in what mode they want to use to get the information. Something like

website = Scraper("https://www.google.com", "DIRECT")
website = Scraper("https://www.website_here.com/articles/article.pdf", "PDF")

Then you build your class around it. You want your code to be easily accessible and usable by a user. As a first time user, I would have no idea if I should use Website1 or Website2. What are the differences? Why should I use one over the other? You should always look at your code through the lens of a user.

answered Jul 7, 2020 at 16:16

Ben A

10.8k5 gold badges40 silver badges103 bronze badges

\$\begingroup\$ Ok, and then use if/then logic to implement the specific scraping criteria for each website? For example: get_info(self): if self.url == "google.com": return scrape_google() elif self.url == "website_here.com/articles/article.pdf": return scrape_website_here()` \$\endgroup\$

FarFetched
– FarFetched

2020-07-08 10:22:14 +00:00
Commented Jul 8, 2020 at 10:22
\$\begingroup\$ @JeromeB No, scrape the website that's passed to the constructor. If you want to make the class flexible, do not create individual methods for certain websites. \$\endgroup\$

Ben A
– Ben A

2020-07-08 15:29:34 +00:00
Commented Jul 8, 2020 at 15:29
\$\begingroup\$ But each website will have different tags to scrape from, for example if I am scraping the price of a product on amazon.com and on ebay.com, they will have different structure to scrape (different tags in the html code).Where do I write the specific code for each website? Or do I try to put all specific code in method arguments, but I think it would be complicated to condense all that logic in a few arguments \$\endgroup\$

FarFetched
– FarFetched

2020-07-08 16:01:04 +00:00
Commented Jul 8, 2020 at 16:01

Add a comment |

Stack Exchange Network

Do I need to use inheritance and classes for my OOP webscraper? [closed]

1 Answer 1

Hot Network Questions

Do I need to use inheritance and classes for my OOP webscraper? [closed]

1 Answer 1

Related

Hot Network Questions