BeautifulSoup: get text from web pages

BeautifulSoup is a popular Python package designed for web scraping. It uses a handful of parser, such as lxml and html.parser to handle HTML and XML documents. Besides extracting data, BeautifulSoup also allows you to navigate through HTML document tree as well as modifying it programatically. BeautifulSoup is usually used in conjunction with requests package, where requests fetches a page, and BeautifulSoup will extract the resulting data from it.

In this article, you are going to learn how to get text from HTML and XML documents with BeautifulSoup with get_text() method, removing any unnecessary tags. The code in the article is written in Python 3 and web pages are grabbed with requests library. Quotes to Scrape website is used as the source which we will scrape information from.

BeautifulSoup get text

BeautifulSoup has a built-in method to parse the text out of an element, which is get_text(). In order to use it, you can simply call the method on any Tag or BeautifulSoup object. get_text() does not work on NavigableString because the object itself represents a string.

from bs4 import BeautifulSoup import requests # Fetch the page and create a Beautiful Soup object page = requests.get("https://quotes.toscrape.com/") soup = BeautifulSoup(page.text, "lxml") # Get the raw text of first quote quote_elem = soup.find("div", class_="quote") quote = quote_elem.find("span", class_="text") quote_text = quote.get_text() print(quote_text)
Code language: Python (python)

Running the code snippet above and we will get the correct result:

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Code language: JavaScript (javascript)

BeautifulSoup get text with <br> tags

By default, BeautifulSoup get_text() inserts a new line character every time a tag closes. There are times when you want to get the text from an element that is separated by <br> tags instead of the proper

tags. Let's say we have a HTML element that looks like below.

<div class="example"> Lorem ipsum dolor sit amet? <br> consectetur adipiscing elit. <br> Vivamus nec <a class="someLink" href="example.com">arcu</a> erat. <br> Suspendisse a mauris vestibulum, rhoncus. <br> </div>
Code language: HTML, XML (xml)

You can use get_text() with an undocumented separator parameter to get the text inside the div like so.

html = """<div class="example"> Lorem ipsum dolor sit amet? <br> consectetur adipiscing elit. <br> Vivamus nec <a class="someLink" href="example.com">arcu</a> erat. <br> Suspendisse a mauris vestibulum, rhoncus. <br> </div>""" soup = BeautifulSoup(html, "lxml") elem = soup.find("div", class_="example") elem = soup.find("h1") print(elem.get_text(separator=" "))
Code language: Python (python)

The result looks like this

image-20211124143315303

Alternatively, you can replace every single <br> tag with an unique string of your choice, then once you get the output, replace that string back to newlines. Better yet, you can replace <br> tag with the newline character \n.

html = """<div class="example"> Lorem ipsum dolor sit amet? <br> consectetur adipiscing elit. <br> Vivamus nec <a class="someLink" href="example.com">arcu</a> erat. <br> Suspendisse a mauris vestibulum, rhoncus. <br> </div>""" html = html.replace("<br>", "myuniquetoken") soup = BeautifulSoup(html, "lxml") elem = soup.find("div", class_="example") elem = soup.find("h1") output = elem.get_text() output = output.replace("myuniquetoken", "\n")
Code language: Python (python)

Please note that sometimes, websites use a combination of <br> and </br>, both of them should be accounted for.

Handling extra spaces and newlines in get_text() output

After using BeautifulSoup get_text(), you may need to apply a few post processing to fine-tune the final result. Usually, the text comes with unnecessary newlines, tabs and spaces. If you want to trim those, use Python's replace() string method would be a good idea. You would want to look for \n, \r, double spaces and combinations of them.

output = elem.get_text() output = output.replace("\r\n", "") output = output.replace("\n\r", "") output = output.replace("\n\n", "") output = output.replace("\n\n", "") ....
Code language: Python (python)

We hope that the information above is useful to you. You may be interested in our guide on fixing “pip: command not found” error, “[Errno 32] Broken pipe” in Python, fix “Shadows name from outer scope” in PyCharm and How to find an element by class with BeautifulSoup.

Click to rate this post!
[Total: 1 Average: 5]

Leave a Comment