How To Build a Web Scraper?
Before we dive into building a Web Scraper that allows us to find the right jobs, first we need to understand that not every website can do something like this. Now, we can ask: Is web scraping legal? And now we have an answer: It depends. If you decided to scrape data from the production website, you need to know that you can even overload the website which leads to a crash website if the website is aimed at a small number of users. You need to be so careful because your easy exercise can sometimes change into a denial-of-service(DoS) attack accidentally.
In our exercise we aim for a big website, which can serve lots of users, so we shouldn’t have any problems with this. We use a special library for this case: Beautiful Soup, which allows us to scrape the data in a very professional way. Let’s take a look.
First of all, we need to install this library using commands:
pip install bs4
Now, parsing the website is easy. Let’s create a script: scraper.py and import some packages:
import requests as _requests
import bs4 as _bs4
def _generate_url_job(what:str, where:str, posted:int ) -> str:
To get a page job we need the request library to send get request to the server:
def _get_page_job(url: str) -> _bs4.BeautifulSoup:
page = _requests.get(url)
soup = _bs4.BeautifulSoup(page.content, "html.parser")
Then, let’s generate a URL to find jobs on indeed.com. I will be looking for a remote job like Python in remote locations posted 3 days long.
We will be looking for a class div with the name: ‘job_seen_beacon’ in HTML code. Then we need to inspect where our information is hidden, so we look that it is in h2, then in span class ‘companyName’ and so on.
def find_jobs(what, where, posted):
url=_generate_url_job(what, where, posted)
page = _get_page_job(url)
jobs =page.find_all('div', class_='job_seen_beacon')
job_dic = dict()
for index, job in enumerate(jobs):
job_title = job.find('h2').text
company_name = job.find('span', class_='companyName').text.replace(' ','')
posted_date = job.find('span', class_='date').text.split('</span>').split('Posted')[-1]
more_info = 'https://indeed.com'+job.tbody.h2.a['href']
job_dic[index] = job_title, company_name, posted_date, more_info
I left you a print statement to check the real jobs individually with the job title, link, company name, and small info about the job.
Now, it is enough to print the statement:
print(find_jobs('python', 'remote', 3))
And we have a result for example Python Engineer.
As you see it is a very easy process, but it demands inspecting HTML code and finding the real information. As always I wish you lots of luck and happy coding in exploring web scraping in Web.