In this blog, we will learn about BeautifulSoup Find() and Find_all() function is used to parse the Scraped HTML Content to get useful data from the web.
You will mostly use the Find and Find_all function whenever scraping using python’s BeautifulSoup.
Learn about Python Web Scraping Error Handling to understand this blog better.
Let’s look at the
Find() Function in the BeautifulSoup python library.
find(tag_name, attrs, recursive, string, **kwargs)
The First argument of the find() function is the tag_name. The Tag argument is the same as the HTML tags but it is passed in string form.
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://pythonscraping.com/blog') bs = BeautifulSoup(html, 'html.parser') h1 = bs.find('h1') print(h1.get_text())
Posts to Scrape
The second argument which the
find() function takes is the attribute, like class, id, value, name attributes (HTML attributes).
The third argument in the
find() function is a boolean value. Recursion tells us how deeply we want to find a tag in the BeautifulSoup object.
Find() function is not able to find anything, it returns none object.
Find_all() Function in BeautifulSoup tries to find all the matched Tag and returns a list.
find_all(name, attrs, recursive, string, limit, **kwargs)
The Function signature of
find_all() is very similar to the find function, the only difference is that it takes one more argument that is the limit. You can control the number of scrapes using the
find_all()‘s limit argument function.
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://pythonscraping.com/blog') bs = BeautifulSoup(html, 'html.parser') h2 = bs.find_all('h2') print(h2)
The Output we get is a list.
We can get the text from the html tags using the get_text() function on the each list’s content.
for h in h2: print(h.get_text())
Finally we got all the
'h2' content separated by newline.
Scraping Title Using the Attribute
Suppose you want to scrape the Title of the Web page, but the problem is there are many
'h1' tags. You can differentiate the
'h1' tags using the class attribute.
h1 = bs.find_all('h1', class_="title") print(h1.get_text())
As we know class is a python’s reserved keyword and thus cannot use it for naming a variable. So the attribute is named
class_ instead of
Posts to Scrape
Multiple Tags in Find_all()
we can also pass a list of tags that we want to scrape.
content = bs.find_all(['h1', 'p']) print(content)
[<h1 class="title" id="page-title"> Posts to Scrape </h1>, <p>Well, the second edition has been out for a few months now, but the nice thing about</p>, <p>Four new chapters:</p>, <p>As muchurrently working on the following major changes:</p>, <p dir="ltr">Ah,rise software platforms.</p>, <p>How the mighty have fallen.</p>, <p>So, after the Selenium Python library.</p>, <p>It's a commonly Not really. </p>, <p>One......</p>]
The Output is stripped. Run the code on your machine to get the real output.
Learn more about Python Web Scraping error handling using Beautiful Soup from the BS4 Documentation.