Selection Techniques

<< Click to Display Table of Contents >>

Navigation:  »No topics above this level«

Selection Techniques

Mastering the essential selection techniques is a critical aspect of web-scraping. When you point-and-click on content in the web browser to create agent commands, you are using the most basic selection technique. In addition to the simple point-and-click feature, Content Grabber provides a range of tools to help you make precise selections, including:

XPath Editor - gives you the ability to manipulate the selection XPath

Tree View Window - gives a more precise view of a web page

List Tool - helps you select a list of web page elements

Anchor Tool - helps you apply conditions to the selection XPath.

 

It is important to realize that you have to be explicit with Content Grabber. You have to be deliberate and specific when selecting each of the HTML elements that you want to capture. Sometimes it can be confusing, as when you want to capture the elements of a search-result listing and each entry has a heading. In some cases, the heading will be a link and in other cases it will be plain text, so the HTML for the headings may look something like this:

 

<h1><a href="http://website.com">Heading as a link</a></h1>

<h1><a href="http://website.com">Heading as a link</a></h1>

<h1><span>Heading as plain text</span></h1>

<h1><a href="http://website.com">Heading as a link</a></h1>

 

If the first two headings are links, then only the link headings will be chosen - not the headings that are plain text. To extract the text of all the headings in this example, you might use the first two headings to create a list selection. As with many software applications, you have to be explicit with Content Grabber. It does not somehow know that you want to extract all the headings, and its default function will be to assume that you are trying to select only the links. In such a case, you would need to change the selection so that it selects the <h1> tag instead of the link tag - before you create the list.