Python Web Scraping Error Handling – BeautifulSoup

Python Web Scraping Error Handling

Web Scraping means collecting Website Data. And we all know the website’s format depends upon the developer and the code. This may also happen that code is poorly formatted on a website that we want to scrape. So Error Handling becomes very important to overcome those situations. In this blog about Python web Scraping using Beautiful Soup, we are going to learn about Error Handling.

Learn more Python stuff like Lane Detection OpenCV, Text to Speech in Python, and many more.

Types of Error While Fetching a Website

Before Scraping any website, we need to think of two situations about the the website that we are about to scrape:

Status Code 404 (Page Not Found)
It may happen that the web page that we are looking to scrape is not available, and the webserver presents us with a beautiful 404 status code.
Server Error (Status code 500)
It is also possible that the server of the website is down when we want to scrape it.

These are mainly two kinds of error when we want to fetch a website from the server.

urllib HTTPError Handling

In this section, we will try to handle the HTTP Error Code i.e Status Code 404 (Page Not Found Error).

from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen("http://pythonscraping.com/blog/second-edition-changes")
except HTTPError as e:
    print(e)
else:
    print("No Error")

In the above code, we are trying to fetch http://pythonscraping.com/blog/second-edition-changes and this page exists and thus we will not get any kind of error:

Output:

No Error

Now, lets change the page we are trying to fetch ex: http://pythonscraping.com/fetching/page-that-do-not-exist. It will be intersting to see what we will get in Output.

HTTP Error 404: Not Found

Urllib URLError Handling

In the previous section, we learned how to handle the 404 HTTP Errors using urllib HTTPError. In this section, we will take a look at how we can handle the Server Error i.e (500 Internal Server Error).

This may happen that we may enter the wrong website URL that we are trying to scrape and this will come under URL Error in the URLError in urllib.

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen("http://pythonscrapingdontexist.com/blog")
except HTTPError as e:
    print(e)
except URLError as e:
    print("Website Can't be reached")
else:
    print("No Error")

Output:

Website Can't be reached

The above two sections discussed Error handling when we are not able to fetch the website. But Error may also be within the scraped website. Let’s discuss those errors.

NonExistentTag Error

Let’s assume we successfully got the content of the website. Now it’s time to scrape useful content for us. We scrape data using the Tags. What if the Tag is not available in the website.

When we are trying to access a Tag in the HTML Content using Beautiful Soup, and the Tag is not found, BeautifulSoup returns a None Object and accessing a Tag in None object throws an AttributeError.

print(bs.find("nonExistent"))

When find function is given "nonExistent" Tag, it returns None Object.

print(bs.find("nonExistent").sometag)
AttributeError: 'NoneType' object has no attribute 'sometag'

One thing to note, nonExistentTag is deprecated. Use the above code to get the None Object to Handle the errors.

Python Web Scraping Error Handling

In the above section, we took a brief look at the Non Existent Tag Error and the None Object. Now Let’s Handle the Web Scraping Tag Error.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://pythonscraping.com/blog')
bs = BeautifulSoup(html, 'html.parser')
try:
    content = bs.find("table").parent()
except AttributeError as e:
    print("Tag Not Found :", e)
else:
    if content == None:
        print("Tag Not Found")
    else:
        print(content)

Output:

Tag Not Found : 'NoneType' object has no attribute 'parent'

If we try to access any tag on the None object type the AttributeError will be caught but if we don’t access any tag on None Object the control will be passed to else block. And then the content’s type will be matched, if its None type that means Tag was not found, else successfully found the tag.

Learn more about Python Web Scraping error handling using Beautiful Soup from the BS4 Documentation.

Tags: , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *