Web Scraping means collecting Website Data. And we all know the website’s format depends upon the developer and the code. This may also happen that code is poorly formatted on a website that we want to scrape. So Error Handling becomes very important to overcome those situations. In this blog about Python web Scraping using Beautiful Soup, we are going to learn about Error Handling.
Learn more Python stuff like Lane Detection OpenCV, Text to Speech in Python, and many more.
Types of Error While Fetching a Website
Before Scraping any website, we need to think of two situations about the the website that we are about to scrape:
These are mainly two kinds of error when we want to fetch a website from the server.
urllib HTTPError Handling
In this section, we will try to handle the HTTP Error Code i.e Status Code 404 (Page Not Found Error).
from urllib.request import urlopen
from urllib.error import HTTPError
try:
html = urlopen("http://pythonscraping.com/blog/second-edition-changes")
except HTTPError as e:
print(e)
else:
print("No Error")
In the above code, we are trying to fetch http://pythonscraping.com/blog/second-edition-changes and this page exists and thus we will not get any kind of error:
Output:
No Error
Now, lets change the page we are trying to fetch ex: http://pythonscraping.com/fetching/page-that-do-not-exist. It will be intersting to see what we will get in Output.
HTTP Error 404: Not Found
Urllib URLError Handling
In the previous section, we learned how to handle the 404 HTTP Errors using urllib HTTPError
. In this section, we will take a look at how we can handle the Server Error i.e (500 Internal Server Error).
This may happen that we may enter the wrong website URL that we are trying to scrape and this will come under URL Error in the URLError
in urllib
.
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
try:
html = urlopen("http://pythonscrapingdontexist.com/blog")
except HTTPError as e:
print(e)
except URLError as e:
print("Website Can't be reached")
else:
print("No Error")
Output:
Website Can't be reached
The above two sections discussed Error handling when we are not able to fetch the website. But Error may also be within the scraped website. Let’s discuss those errors.
NonExistentTag Error
Let’s assume we successfully got the content of the website. Now it’s time to scrape useful content for us. We scrape data using the Tags. What if the Tag is not available in the website.
When we are trying to access a Tag in the HTML Content using Beautiful Soup, and the Tag is not found, BeautifulSoup returns a None Object and accessing a Tag in None object throws an AttributeError
.
print(bs.find("nonExistent"))
When find function is given "nonExistent"
Tag, it returns None
Object.
print(bs.find("nonExistent").sometag)
AttributeError: 'NoneType' object has no attribute 'sometag'
One thing to note, nonExistentTag
is deprecated. Use the above code to get the None Object to Handle the errors.
Python Web Scraping Error Handling
In the above section, we took a brief look at the Non Existent Tag Error and the None Object. Now Let’s Handle the Web Scraping Tag Error.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://pythonscraping.com/blog')
bs = BeautifulSoup(html, 'html.parser')
try:
content = bs.find("table").parent()
except AttributeError as e:
print("Tag Not Found :", e)
else:
if content == None:
print("Tag Not Found")
else:
print(content)
Output:
Tag Not Found : 'NoneType' object has no attribute 'parent'
If we try to access any tag on the None object type the AttributeError
will be caught but if we don’t access any tag on None Object the control will be passed to else block. And then the content’s type will be matched, if its None type that means Tag was not found, else successfully found the tag.
Learn more about Python Web Scraping error handling using Beautiful Soup from the BS4 Documentation.