Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. It provides Pythonic idioms for iterating, searching, and modifying the parse tree, making it easier to work with HTML or XML files. It sits atop an HTML or XML parser, providing Pythonic ways to navigate, search, and modify the parse tree.
Installation
Before using Beautiful Soup, you need to install it alongside a parser. Beautiful Soup supports various parsers like html.parser
(built-in), lxml
, and html5lib
. lxml
is much faster and recommended for practical use.
1 2 |
pip install beautifulsoup4 pip install lxml # Recommended parser |
Basic Usage
Let’s start with a basic example that fetches and parses HTML from a webpage:
1 2 3 4 5 6 7 8 |
from bs4 import BeautifulSoup import requests url = "http://example.com/" response = requests.get(url) soup = BeautifulSoup(response.text, 'lxml') print(soup.prettify()) |
Searching the Tree
Beautiful Soup defines numerous methods for searching the parse tree, including searching by tag name and searching using filters based on a tag’s attributes.
Finding All Instances of a Tag
1 2 |
for link in soup.find_all('a'): print(link.get('href')) |
Finding the First Instance of a Tag
1 |
print(soup.find('a').get('href')) |
Navigating the Tree
You can navigate the structure of the page by using tag names as attributes.
1 2 3 |
head_title = soup.head.title print(head_title) # <title>Example Domain</title> print(head_title.string) # Example Domain |
Parsing a Local HTML File
You can also use Beautiful Soup to parse local HTML files. This is particularly useful for testing and development.
1 2 3 4 |
with open("index.html") as file: soup = BeautifulSoup(file, "lxml") print(soup.prettify()) |
Working with CSS Selectors
Beautiful Soup also allows you to find elements by CSS selectors using the .select()
method.
1 2 |
for tag in soup.select("div.myClass > a"): print(tag.get('href')) |
Modifying the Tree
Beautiful Soup allows you to edit the HTML or XML tree, such as adding, modifying, and deleting tags.
1 2 3 4 5 6 7 |
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml') tag = soup.b tag.name = "blockquote" tag['class'] = 'verybold' tag['id'] = 1 del tag['class'] print(soup.prettify()) |
Example: Extracting Article Titles from a Blog
Here’s a more practical example where Beautiful Soup is used to extract article titles from a blog page:
1 2 3 4 5 6 7 8 9 10 |
from bs4 import BeautifulSoup import requests url = "https://blog.example.com/" response = requests.get(url) soup = BeautifulSoup(response.text, 'lxml') titles = soup.find_all('h2', class_='post-title') for title in titles: print(title.text.strip()) |
Conclusion
Beautiful Soup is a powerful and flexible library that makes web scraping in Python simple and accessible. By abstracting the complexities of parsing and querying HTML or XML documents, it enables developers to focus on the data they need to extract and manipulate, making it an invaluable tool for web scraping and data analysis projects. Whether you’re parsing HTML from the web or local files, Beautiful Soup simplifies the process, enabling you to extract and manipulate data efficiently.