BeautifulSoup Find() and Find_all() Function

BeautifulSoup Find and Find_all

In this blog, we will learn about BeautifulSoup Find() and Find_all() function is used to parse the Scraped HTML Content to get useful data from the web.

You will mostly use the Find and Find_all function whenever scraping using python’s BeautifulSoup.

Learn about Python Web Scraping Error Handling to understand this blog better.

BeautifulSoup Find()

Let’s look at the Find() Function in the BeautifulSoup python library.

find(tag_name, attrs, recursive, string, **kwargs)

The First argument of the find() function is the tag_name. The Tag argument is the same as the HTML tags but it is passed in string form.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://pythonscraping.com/blog')
bs = BeautifulSoup(html, 'html.parser')

h1 = bs.find('h1')
print(h1.get_text())
Posts to Scrape

The second argument which the find() function takes is the attribute, like class, id, value, name attributes (HTML attributes).

The third argument in the find() function is a boolean value. Recursion tells us how deeply we want to find a tag in the BeautifulSoup object.

If the Find() function is not able to find anything, it returns none object.

BeautifulSoup Find_all()

The Find_all() Function in BeautifulSoup tries to find all the matched Tag and returns a list.

find_all(name, attrs, recursive, string, limit, **kwargs)

The Function signature of find_all() is very similar to the find function, the only difference is that it takes one more argument that is the limit. You can control the number of scrapes using the find_all()‘s limit argument function.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://pythonscraping.com/blog')
bs = BeautifulSoup(html, 'html.parser')

h2 = bs.find_all('h2')
print(h2)
[<h2>Buy WsWP from O'Reilly</h2>, <h2>Navigation</h2>, <h2 class="element-invisible">You are here</h2>, <h2>
<a href="/blog/second-edition-changes">Second Edition Is Out!</a>
</h2>, <h2>
<a href="/blog/second-edition">Second Edition Coming this Fall!</a>
</h2>, <h2>
<a href="/blog/xpath-and-scrapy">XPath for Crawling with Scrapy</a>
</h2>, <h2>
<a href="/blog/selenium-headers">Selenium Headers</a>
</h2>, <h2>
<a href="/blog/tos-and-robots">Terms of Service and Robots.txt</a>
</h2>, <h2>
<a href="/blog/javascript">Scraping with JavaScript</a>
</h2>]

The Output we get is a list.

We can get the text from the html tags using the get_text() function on the each list’s content.

for h in h2:
    print(h.get_text())
Buy WsWP from O'Reilly
Navigation
You are here
Second Edition Is Out!
Second Edition Coming this Fall!
XPath for Crawling with Scrapy
Selenium Headers
Terms of Service and Robots.txt
Scraping with JavaScript

Finally we got all the 'h2' content separated by newline.

Scraping Title Using the Attribute

Suppose you want to scrape the Title of the Web page, but the problem is there are many 'h1' tags. You can differentiate the 'h1' tags using the class attribute.

Example:

h1 = bs.find_all('h1', class_="title")
print(h1[0].get_text())

As we know class is a python’s reserved keyword and thus cannot use it for naming a variable. So the attribute is named class_ instead of class.

Posts to Scrape

Multiple Tags in Find_all()

we can also pass a list of tags that we want to scrape.

Example:

content = bs.find_all(['h1', 'p'])
print(content)
[<h1 class="title" id="page-title">
                  Posts to Scrape                
</h1>, 
<p>Well, the second edition has been out for a few months now, but the nice thing about</p>, 

<p>Four new chapters:</p>, 

<p>As muchurrently working on the following major changes:</p>, 

<p dir="ltr">Ah,rise software platforms.</p>, <p>How the mighty have fallen.</p>,
 
<p>So, after the Selenium Python library.</p>, 

<p>It's a commonly Not really. </p>, 

<p>One......</p>]

The Output is stripped. Run the code on your machine to get the real output.

Learn more about Python Web Scraping error handling using Beautiful Soup from the BS4 Documentation.

Tags: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *