Web-Scraping Techniques

<< Click to Display Table of Contents >>

Navigation:  Introduction >

Web-Scraping Techniques

Content Grabber makes it easy to extract data from most websites without requiring much prior knowledge about web-scraping techniques. However, you'll be able to build better web-scraping agents if you know some basic techniques. Some very difficult websites will require in-depth knowledge, but this user guide can help you gain more understanding and direct you to additional resources.

 

The following topics are important if you want to become proficient at web-scraping, but they are not necessary a prerequisite for successful use of Content Grabber on all websites. Click the links to learn more about each topic:

HTML Content - Web pages are driven by HTML, which is the basic language for building websites.

Dynamic Websites - It can be challenging to perform data extraction on dynamic websites. So, it's good to have a general understanding of how JavaScript works, since it is found on most dynamic websites.

XPath and Selection Techniques - Most web scraping tools extract data from a website by selecting web elements on the web page. XPath is a language that manages the web selection.

Regular Expressions - XPath can select a web element such as a paragraph of text, but you may have interest only in a small part of the web element content. Regular Expressions is a language for extracting small bits of text from a larger text element.