Web Scraping means collecting Website Data. And we all know the website’s format depends upon the developer and the code. This may also happen that code is poorly formatted on a website that we want to scrape. So Error Handling becomes very important to overcome those situations. In this blog about Python web Scraping using Beautiful Soup, we are going to learn about Error Handling.
Types of Error While Fetching a Website
Before Scraping any website, we need to think of two situations about the the website that we are about to scrape:
These are mainly two kinds of error when we want to fetch a website from the server.
urllib HTTPError Handling
In this section, we will try to handle the HTTP Error Code i.e Status Code 404 (Page Not Found Error).
from urllib.request import urlopen from urllib.error import HTTPError try: html = urlopen("http://pythonscraping.com/blog/second-edition-changes") except HTTPError as e: print(e) else: print("No Error")
In the above code, we are trying to fetch http://pythonscraping.com/blog/second-edition-changes and this page exists and thus we will not get any kind of error:
Now, lets change the page we are trying to fetch ex: http://pythonscraping.com/fetching/page-that-do-not-exist. It will be intersting to see what we will get in Output.
HTTP Error 404: Not Found
Urllib URLError Handling
In the previous section, we learned how to handle the 404 HTTP Errors using
urllib HTTPError. In this section, we will take a look at how we can handle the Server Error i.e (500 Internal Server Error).
This may happen that we may enter the wrong website URL that we are trying to scrape and this will come under URL Error in the
from urllib.request import urlopen from urllib.error import HTTPError from urllib.error import URLError try: html = urlopen("http://pythonscrapingdontexist.com/blog") except HTTPError as e: print(e) except URLError as e: print("Website Can't be reached") else: print("No Error")
Website Can't be reached
The above two sections discussed Error handling when we are not able to fetch the website. But Error may also be within the scraped website. Let’s discuss those errors.
Let’s assume we successfully got the content of the website. Now it’s time to scrape useful content for us. We scrape data using the Tags. What if the Tag is not available in the website.
When we are trying to access a Tag in the HTML Content using Beautiful Soup, and the Tag is not found, BeautifulSoup returns a None Object and accessing a Tag in None object throws an
When find function is given
"nonExistent" Tag, it returns
AttributeError: 'NoneType' object has no attribute 'sometag'
One thing to note,
nonExistentTag is deprecated. Use the above code to get the None Object to Handle the errors.
Python Web Scraping Error Handling
In the above section, we took a brief look at the Non Existent Tag Error and the None Object. Now Let’s Handle the Web Scraping Tag Error.
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://pythonscraping.com/blog') bs = BeautifulSoup(html, 'html.parser') try: content = bs.find("table").parent() except AttributeError as e: print("Tag Not Found :", e) else: if content == None: print("Tag Not Found") else: print(content)
Tag Not Found : 'NoneType' object has no attribute 'parent'
If we try to access any tag on the None object type the
AttributeError will be caught but if we don’t access any tag on None Object the control will be passed to else block. And then the content’s type will be matched, if its None type that means Tag was not found, else successfully found the tag.
Learn more about Python Web Scraping error handling using Beautiful Soup from the BS4 Documentation.