How to remove tags with BeautifulSoup

When it comes to web scraping with Python, not many libraries can surpass BeautifulSoup in terms of features and ease of use. It can help you save yourself a few hours, or even days of work using just a few lines of code.

Using BeautifulSoup, HTML documents are parsed into a tree, which contains tags or text.In this article, we will show you how to remove a HTML tag in BeautifulSoup.

Remove tags with extract()

BeautifulSoup has a built in method called extract() that allows you to remove a tag or string from the tree. Once you've located the element you want to get rid of, let's say it's named i_tag, calling i_tag.extract() will remove the element and return it at the same time.

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, 'html.parser') a_tag = soup.a i_tag = soup.i.extract() a_tag # <a href="http://example.com/">I linked to</a> i_tag # <i>example.com</i> print(i_tag.parent) # None
Code language: Python (python)

Please do note that the element returned is a bs4.Tag or bs4.NavigableString, not a Python string. The original BeautifulSoup object is now modified. If you try to find the original i_tag, it won't be found.

Remove tags with decompose()

In case you don't care about the content of the tag and just want to destroy it completely, use BeautifulSoup decompose() method. Once called, i_tag.decompose() will remove i_tag and its contents from the BeautifulSoup tree completely without returning anything.

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, 'html.parser') a_tag = soup.a i_tag = soup.i i_tag.decompose() a_tag # <a href="http://example.com/">I linked to</a>
Code language: Python (python)

If you want to really sure that the tag is decomposed, you can check its .decomposed property.

i_tag.decomposed # True a_tag.decomposed # False
Code language: Python (python)

Remove tags with unwrap()

BeautifulSoup's unwrap() method replaces a tag with the contents inside that tag, returning the tag that was replaced. If you want to remove a parent HTML tag from the BeautifulSoup tree and keeping its children and descendants, this is the method you're looking for.

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, 'html.parser') a_tag = soup.a a_tag.i.unwrap() a_tag # <a href="http://example.com/">I linked to example.com</a>
Code language: Python (python)

Conclusion

We hope that you found the right method to remove a tag from HTML that is suitable for your case from the information above. If you're interested in more BeautifulSoup basic tutorials, check out our guide on how to find an element by class, how to get text from web pages and how to get attributes of elements in BeautifulSoup.

If you have any questions, then please feel free to ask in the comments below.

Click to rate this post!
[Total: 1 Average: 5]

Leave a Comment