Web Scraping Limitations

<< Click to Display Table of Contents >>

Navigation:  Introduction >

Web Scraping Limitations

Web-scraping can be challenging if you want to mine data from complex, dynamic websites. If you're new to web-scraping, then we recommend that you begin with an easy website: one that is mostly static and has little, if any, AJAX or JavaScript.

After you get familiar with the navigation paths for your target website, you need to identify a good start URL. Sometimes this is simply the start URL of the website, but often the best URL is the one for a sub-page—such as a product listing. Once you have this URL, you’ll need to copy it and then paste it into the address bar of Content Grabber.

NOTE: Some websites allow navigation without any corresponding change in the visible URL. In such cases, you may not have a start URL that points directly to your start webpage, and so you’ll need to add preliminary steps to your agent to navigate to that webpage.

Web-scraping can be also challenging if you don't have the proper tools. Largely, you're completely at the mercy of the target website, and that website can change at anytime - without notice. Or, it may contain faulty JavaScript that causes it to crash and exhibit surprising behavior. The server that hosts the website may crash, or the website may undergo maintenance. Many potential problems can occur during a lengthy web-scraping session, and you have very little influence on any of them. Content Grabber offers an array of advanced error-handling and stability features that can help you manage many of the problems that a web-scraping agent is likely to encounter.

In addition to the unreliable websites, another challenge is that some web-scraping tasks are especially difficult to complete - including the following:

Extracting data from complex websites

Extracting data from websites that use deterrents

Extracting huge amounts of data

Extracting data from non-HTML content.

 

Extracting Data From Complex Websites

If you are developing web-scraping agents for a large number of different websites, you will probably find that around 50% of the websites are very easy, 30% are modest in difficulty, and 20% are very challenging. For a small percentage, it will be effectively impossible to extract meaningful data. It may take two weeks or more for a web-scraping expert to develop an agent for such a website, so the cost of developing the agent is likely to outweigh the value of the data you might be able to extract.

Extracting Data From Websites Using Deterrents

Web-scraping will always be challenging for any website with active deterrents in place. If it is necessary to login to access the content that you want to extract, then the website can always cancel your account and make it impractical to create new accounts.

Some websites uses browser fingerprinting to identify and block your access to the website. Fingerprinting uses JavaScript to make a positive identification by examining your browser and computer specifications, and thereby making it impossible to circumvent.

Another method for websites that are wary of crawlers or scrapers is the use of CAPTCHA. Content Grabber includes tools you can use to overcome CAPTCHA protection, but you'll incur additional costs to get a 3rd-party to do automatic CAPTCHA processing. See CAPTCHA Blocking for more information.

 

The most common protection technique is using your IP address to identify and block your access to a website. You can usually circumvent this technique by using a proxy rotation service, which hides your actual IP address and uses a new IP address every time you request a web page from a website. See IP Blocking & Proxy Servers for more information.

NOTE: Ethically and legally, we recommend that you avoid websites that are actively taking measures to block your access, even if you are able to circumvent the protection.

 

Extracting Huge Amounts of Data

A web-scraping tool must actually visit a web page to extract data from it. Downloading a web page takes time, and it could take weeks and months to load and extract data from millions of web pages. For example, it's virtually impossible to extract all product data from Amazon.com, since there are too many web pages.

 

Extracting Data From Non-HTML Content

Some websites are built entirely in Flash, which is a small-footprint software application that runs in the web browser. Content Grabber can only work with HTML content, so it can only extract the Flash file. However, it can't interact with the Flash application or extract data from within the Flash application.

Many websites provide data in the form of PDF files and other file formats. Though it cannot directly extract data from such files, Content Grabber can easily download those files and convert the files into an HTML document using 3rd-party converters to extract data from the conversion output. The document conversion happens very quickly in real-time, so it will seem as though you are performing a direct extraction. It's important to realize that PDF documents and most file formats don't contain content that is easily convertible into structured HTML. To do that, you can use the Regular Expressions feature of Content Grabber to resolve the conversion output.