codes on tilt shift lens
Programming

How To Build a Web Scraper?

Before we dive into building a Web Scraper that allows us to find the right jobs, first we need to understand that not every website can do something like this. Now, we can ask: Is web scraping legal? And now we have an answer: It depends. If you decided to scrape data from the production website, you need to know that you can even overload the website which leads to a crash website if the website is aimed at a small number of users. You need to be so careful because your easy exercise can sometimes change into a denial-of-service(DoS) attack accidentally.

In our exercise we aim for a big website, which can serve lots of users, so we shouldn’t have any problems with this. We use a special library for this case: Beautiful Soup, which allows us to scrape the data in a very professional way. Let’s take a look.

First of all, we need to install this library using commands:

pip install bs4

Now, parsing the website is easy. Let’s create a script: scraper.py and import some packages:

import requests as _requests
import bs4 as _bs4
def _generate_url_job(what:str, where:str, posted:int ) -> str:
    url=f"https://www.indeed.com/jobs?q={what}&l={where}&fromage={posted}&vjk=6691ca643c8a501e'"
    return url

To get a page job we need the request library to send get request to the server:

def _get_page_job(url: str) -> _bs4.BeautifulSoup:
    page = _requests.get(url)
    soup = _bs4.BeautifulSoup(page.content, "html.parser")
    return soup

Then, let’s generate a URL to find jobs on indeed.com. I will be looking for a remote job like Python in remote locations posted 3 days long.

We will be looking for a class div with the name: ‘job_seen_beacon’ in HTML code. Then we need to inspect where our information is hidden, so we look that it is in h2, then in span class ‘companyName’ and so on.

def find_jobs(what, where, posted):
    url=_generate_url_job(what, where, posted)
    page = _get_page_job(url)
    jobs =page.find_all('div', class_='job_seen_beacon')

    #print(jobs)

    job_dic = dict()

    #print(jobs)

    for index, job in enumerate(jobs):

         #print(jobs)

        job_title = job.find('h2').text

        #print(job_title)

        company_name = job.find('span', class_='companyName').text.replace(' ','')

        #print(company_name)

        posted_date = job.find('span', class_='date').text.split('</span>')[0].split('Posted')[-1]
        more_info = 'https://indeed.com'+job.tbody.h2.a['href']

        #print(posted_date)

        #print(more_info)

        job_dic[index] = job_title, company_name, posted_date, more_info
    return job_dic

I left you a print statement to check the real jobs individually with the job title, link, company name, and small info about the job.

Now, it is enough to print the statement:

print(find_jobs('python', 'remote', 3))

And we have a result for example Python Engineer.

As you see it is a very easy process, but it demands inspecting HTML code and finding the real information. As always I wish you lots of luck and happy coding in exploring web scraping in Web.

Leave a Reply