web scraping a indeed job portal using python and beautiful soup and scrapy in python

To scrape an Indeed job portal using Python, you can use either Beautiful Soup or Scrapy. Beautiful Soup is great for smaller projects, while Scrapy is better suited for larger and more complex projects.

Here is an example of how to scrape job postings from Indeed using Beautiful Soup:

main.py
import requests
from bs4 import BeautifulSoup

url = 'https://www.indeed.com/jobs?q=python+developer&l=New+York'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

jobs = []

for div in soup.find_all(name='div', attrs={'class':'row'}):
    for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
        jobs.append(a['title'])
        
print(jobs)
398 chars
16 lines

Explanation:

  • We use the requests module to fetch the HTML content from the URL.
  • We create a Beautiful Soup object by passing in the HTML content and using the 'html.parser' parser.
  • We define an empty list called jobs to store the extracted job postings.
  • We use the find_all method on the Beautiful Soup object to find all the div tags with a class value of 'row'.
  • For each div tag, we use the find_all method to find all the a tags with a data-tn-element attribute value of 'jobTitle'.
  • We loop over the a tags and append the value of their title attribute to the jobs list.
  • Finally, we print the jobs list to the console.

Here is an example of how to scrape job postings from Indeed using Scrapy:

main.py
import scrapy

class IndeedSpider(scrapy.Spider):
    name = 'indeed'
    start_urls = ['https://www.indeed.com/jobs?q=python+developer&l=New+York']

    def parse(self, response):
        for job in response.xpath('//div[@class="jobsearch-SerpJobCard unifiedRow row result"]'):
            yield {
                'job_title': job.xpath('.//h2/a/@title').get(),
            }
377 chars
12 lines

Explanation:

  • We create a new Scrapy spider by defining a new subclass of scrapy.Spider.
  • We give the spider a name and define the starting URL.
  • We define a parse method that will be called to handle the HTTP response from each URL visited by the spider.
  • We use an XPath selector to target the div elements containing job postings.
  • For each job posting, we define a dictionary containing one key-value pair, where the key is 'job_title' and the value is the job title extracted using another XPath selector.
  • We yield the job dictionary to Scrapy, which will handle the output for us.

gistlibby LogSnag