Anyone out there who has gotten into website scraping and web scraping will know the importance of the BeautifulSoup (bs4) library. Parsing data of HTML pages is a common issue while working with web scraping, BeautifulSoup makes this process much easier by adding ‘soup’ to the line of your code. It will identify the tags of the given page, allowing you to scrap that data very easily. If you’re having trouble finding reliable and clean data on the internet, be sure to use bs4 library.
In this article, we will learn how to get the attributes of an element in a BeautifulSoup tree.
Extract attribute from an element
BeautifulSoup allows you to extract a single attribute from an element given its name just like how you would access a Python dictionary.
element['attribute name']
Code language: Python (python)
For example, the following code snippet prints out the first author link from Quotes to Scrape page.
import requests
from bs4 import BeautifulSoup
url = "https://quotes.toscrape.com/"
soup = BeautifulSoup(requests.get(url).text, "lxml")
first_link = soup.find('div', class_='quote').find('a')
href = first_link['href']
print(href)
# RETURNS
# /author/Albert-Einstein
Code language: Python (python)
Safely get an attribute from an element
Sometimes, the attribute may or may not be present on all elements. In that case, trying to extract it will raise KeyError
.
In [6]: first_link['attr']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-78ace83cff14> in <module>
----> 1 first_link['attr']
~/.local/share/virtualenvs/testing-p6fO7ldL/lib/python3.8/site-packages/bs4/element.py in __getitem__(self, key)
1484 """tag[key] returns the value of the 'key' attribute for the Tag,
1485 and throws an exception if it's not there."""
-> 1486 return self.attrs[key]
1487
1488 def __iter__(self):
KeyError: 'attr'
Code language: Python (python)
In this situation, you should use the get()
method to safely get the attribute out of the element. The method returns the attribute value if it’s found, and None
value otherwise.
Syntax:
In [7]: first_link.get('attr')
In [8]: first_link.get('href')
Out[8]: '/author/Albert-Einstein'
Code language: Python (python)
Get all attributes of an element
In order to get all attributes of an element, you have to print out the attrs
property of the element like what’s demonstrated below.
In [8]: first_link.attrs
Out[8]: {'href': '/author/Albert-Einstein'}
Code language: Python (python)
Turning the attributes into lists is easy, too, just use keys()
and values()
to do that. If you absolutely need a Python list, you can also cast the whole result into a list.
In [14]: author1 = soup.find_all(attrs={'': 'author'})[0]
In [15]: list(author1.attrs)
Out[15]: ['class', '']
In [16]: author1.attrs.values()
Out[16]: dict_values([['author'], 'author'])
In [17]: author1.attrs.keys()
Out[17]: dict_keys(['class', ''])
In [18]: list(author1.attrs.values())
Out[18]: [['author'], 'author']
Code language: Python (python)
If you want to filter HTML/XML tags that has the same attribute, you can pass a dict to attrs
dictionary of find()
or find_all()
.
In [10]: soup.find_all(attrs={'': 'author'})
Out[10]:
[<small class="author" ="author">Albert Einstein</small>,
<small class="author" ="author">J.K. Rowling</small>,
<small class="author" ="author">Albert Einstein</small>,
<small class="author" ="author">Jane Austen</small>,
<small class="author" ="author">Marilyn Monroe</small>,
<small class="author" ="author">Albert Einstein</small>,
<small class="author" ="author">André Gide</small>,
<small class="author" ="author">Thomas A. Edison</small>,
<small class="author" ="author">Eleanor Roosevelt</small>,
<small class="author" ="author">Steve Martin</small>]
Code language: Python (python)
Extract attributes that matches a condition
A common scenario is getting attributes of tags that matches a certain condition. This all boils down to selecting/fitlering the right elements.
You can use the find()
and find_all()
method with class_
, id
, attrs
arguments to do that. But did you know that you can even use regex?
The example below use regex to find all elements that contain numbers on id
attributes.
import re
# Find all elements contain number on id
soup.find_all(id=re.compile("\d"))
Code language: Python (python)
find()
and find_all()
method also supports searching by a function, which we can use to our advantage.
# Find all elements contain number on
In [24]: soup.find_all(lambda tag: tag.get('') and 'author' in tag.get(''))
Out[24]:
[<small class="author" ="author">Albert Einstein</small>,
<small class="author" ="author">J.K. Rowling</small>,
<small class="author" ="author">Albert Einstein</small>,
<small class="author" ="author">Jane Austen</small>,
<small class="author" ="author">Marilyn Monroe</small>,
<small class="author" ="author">Albert Einstein</small>,
<small class="author" ="author">André Gide</small>,
<small class="author" ="author">Thomas A. Edison</small>,
<small class="author" ="author">Eleanor Roosevelt</small>,
<small class="author" ="author">Steve Martin</small>]
Code language: Python (python)
We hope that the information above is useful to you. You may be interested in our guide on fixing “pip: command not found” error, “[Errno 32] Broken pipe” in Python, fix “Shadows name from outer scope” in PyCharm and How to find an element by class with BeautifulSoup.