Web Scraping using BeautifulSoup
Srishty Suman * 30-June-2019
Web Scraping, BeautifulSoup, Pandas.
Everyday we need to extract data from the web or internet to build a descriptive model. For this, either we rely on SQL and NoSQL databases, APIs, or ready-made CSV datasets. But the problem arises, when we can’t get the desired datasets or databases are not kept or APIs are either expensive or have usage limits.
Then the solutions of all these problems is WEB SCRAPING
What is Web Scraping?
Web scraping is a term for various methods used to collect information from across the Internet. Generally, this is done with software that simulates human Web surfing to collect specified bits of information from different websites. Those who use web scraping programs may be looking to collect certain data to sell to other users or to use for promotional purposes on a website.
Web scraping is also called Web data extraction, screen scraping or Web harvesting.
Why Web Scraping is used?
There are various places where using web scraping helps to get all data you want to fulfil your purpose.
- Websites are more important than APIs
- No Rate-Limiting
- Anonymous Access
- The Data is already in your face
Applications that requires Web Scraping
There are various application that requires web scraping
- Competitive Pricing
- Market Sentiment Analysis
- Customer Sentiment Analysis
- Email address gathering
- Social Media Scraping
- Research and Development
- Job listings
Supporting Library for Web Scraping
Considering Python as a scripting language, there are various libraries available. Below are few basic library using this we can scrap any web content.
- Requests(to get the html content)
- BeautifulSoup(to parse the html)
- Pandas(to make a Dataframe and write to a csv)
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
How to extract data through Web Scraping
For extracting data through Web Scraping using BeautifulSoup, we need to follow given steps:
Step 1: Import all the necessary libraries used for getting data:
- import requests
- from bs4 import BeautifulSoup
- import pandas as pd
Step 2: Store the url you want to scrap
Step 3: Inspecting the Page
- The data is usually nested in tags. So, we inspect the page to see, under which tag the data we want to scrape is nested. To inspect the page, just right click on the element and click on “Inspect”.
Step 4: Find the data you want to extract
- We have to extract the desired data from the websites.
Step 5: Get the html contents from the page. This is done using the requests library
- r = requests.get(baseUrl)
- c = r.content
Step 6: Parse the html. This is done with BeautifulSoup.
- soup = BeautifulSoup(c,"html.parser")
Step 7: Store the data in a required format
- After extracting the data, you might want to store it in a format. This format varies depending on your requirement. For this example, we will store the extracted data in a CSV (Comma Separated Value) format.
Lets code to use BeauitfulSoup for scraping Flipkart website.
In this blog, we discussed about the use of Web Scraping and also learnt about the way of scraping using BeautifulSoup. Through Web Scraping, the data can be used to perform analytics and derive meaningful insights.
About Srishty Suman
Srishty Suman has 1+ years of experience in Data Engineering. She mostly worked in organising data model from unorganised data from multiple courses. She also prepared prepared statistical information reports and represented them graphically for better visualisation. In her free time, she likes writing blogs, traveling and exploring new places.