After you get familiar with the navigation paths for your target website, you need to identify a good start URL. Sometimes this is simply the start URL of the website, but often the best URL is the one for a sub-page—such as a product listing. Once you have this URL, you’ll need to copy it and then paste it into the address bar of Content Grabber.
In addition to the unreliable websites, another challenge is that some web-scraping tasks are especially difficult to complete - including the following:
•Extracting data from complex websites
•Extracting data from websites that use deterrents
•Extracting huge amounts of data
•Extracting data from non-HTML content.
If you are developing web-scraping agents for a large number of different websites, you will probably find that around 50% of the websites are very easy, 30% are modest in difficulty, and 20% are very challenging. For a small percentage, it will be effectively impossible to extract meaningful data. It may take two weeks or more for a web-scraping expert to develop an agent for such a website, so the cost of developing the agent is likely to outweigh the value of the data you might be able to extract.
Web-scraping will always be challenging for any website with active deterrents in place. If it is necessary to login to access the content that you want to extract, then the website can always cancel your account and make it impractical to create new accounts.
Another method for websites that are wary of crawlers or scrapers is the use of CAPTCHA. Content Grabber includes tools you can use to overcome CAPTCHA protection, but you'll incur additional costs to get a 3rd-party to do automatic CAPTCHA processing. See CAPTCHA Blocking for more information.
The most common protection technique is using your IP address to identify and block your access to a website. You can usually circumvent this technique by using a proxy rotation service, which hides your actual IP address and uses a new IP address every time you request a web page from a website. See IP Blocking & Proxy Servers for more information.
NOTE: Ethically and legally, we recommend that you avoid websites that are actively taking measures to block your access, even if you are able to circumvent the protection.
A web-scraping tool must actually visit a web page to extract data from it. Downloading a web page takes time, and it could take weeks and months to load and extract data from millions of web pages. For example, it's virtually impossible to extract all product data from Amazon.com, since there are too many web pages.
Some websites are built entirely in Flash, which is a small-footprint software application that runs in the web browser. Content Grabber can only work with HTML content, so it can only extract the Flash file. However, it can't interact with the Flash application or extract data from within the Flash application.
Many websites provide data in the form of PDF files and other file formats. Though it cannot directly extract data from such files, Content Grabber can easily download those files and convert the files into an HTML document using 3rd-party converters to extract data from the conversion output. The document conversion happens very quickly in real-time, so it will seem as though you are performing a direct extraction. It's important to realize that PDF documents and most file formats don't contain content that is easily convertible into structured HTML. To do that, you can use the Regular Expressions feature of Content Grabber to resolve the conversion output.