In the previous section, we selected our Start URL and loaded the web page into Content Grabber. Next, you can select the data you want to capture and start building your web scraping agent. In our Cruise Direct example, we plan to search for available cruise vacations and then extract the cruise name, the cruise line, destination, departing place, ports of call, and the different prices.
1.Firstly we need to perform a search to retrieve the data for the available cruises. To do this, we select the orange Search button element with the mouse, then click one more time to display the Content Grabber Message window.
Content Grabber Message Window
2.From the message window, we choose the Click on the Web Element option to add a new command to the agent that will execute the search and display the search results on a new web page. Notice that Content Grabber has added our first command to Agent Explorer - in this case to execute the search and display the search results.
Agent Explorer with new Search command linked to new Search page
3.We are now ready to add commands to our agent to extract the cruise data. As we have a number of data elements in tables, we will use a list to simplify the extraction for us. To capture a data element, move your mouse precisely over the data element you want, until you see the data-capture box around it. We start by selecting the first cruise name.
First Cruise Line data element selected within Content Grabber
4.Then click List in the Configure Agent Command panel to activate the list selection mode.
Activating List selection mode from the Configure Agent Command panel
5.In list selection mode, we can add web data elements into the list by clicking similar data elements. Now we'll click on the second cruise name and you will see Content Grabber has selected the remaining data elements on the page. Note: if any cruise data elements remain unselected, simply click on these to add to the list.
Second Cruise Line data element selected while in List selection mode
6.We now click Save to save the list and exit list selection mode. The Web Element list command defines the list area, so any elements within this area are now included within the list.
7.To capture the cruise name text, we click on any selected element to display the Content Grabber Message window. From the Content Grabber Message window, choose the Capture Text option to add the web element command to capture the cruise names. We have now added new web element list and web element commands to the Agent Explorer and Content Grabber has set default names for these commands.
8.To edit the names for the commands, we click the respective Edit icons and set the names of the commands to ‘Search List’ and ‘Cruise Name’. Then click the Green Tick to save.
Agent Explorer with new Search List and Cruise Name commands
9.Now we plan to extract the individual cruise data elements from each table. So firstly click on the cruise line data element. Content Grabber now automatically selects all the cruise line information because it is already defined as a list.
10. Next, click on the cruise line data element one more time to display the Content Grabber Message window. Now choose the Capture Text option from the Content Grabber Message window to add the command to the Agent so we can capture the individual cruise line data elements.
11. After that, we click the Edit icon to change the name of the command to ‘Cruise Line’ then save it.
12. Now we do the same for the Destination and Departing From data elements, then set the respective names of the commands to ‘Destination’ and ‘Departing From’ and save them.
Agent Explorer with new Capture Text commands
13. To extract the ‘Ports of Call’ data we will utilize Content Grabber's content transformation method. Refer to Refine Your Data to see how this is done.
14. We also want to capture all the price information in the pricing table, so as before, we select the first element (View Itinerary) in the pricing table from the list. Content Grabber will then select all of the first elements throughout the list.
15. Click one more time to display the Content Grabber Message Window. Then choose the Capture Text option to add the command to the Agent. Before we change the default command name, we plan to add all the individual price details to the Agent, then we will change all the names at once.
16. Do steps 14 and 15 again for the ‘inside’, ‘outside’, ‘balcony’ and ‘suite’ data elements to add 4 new commands to the Agent.
17.Now we will change the names of the commands to ‘Departing Date’, ‘Inside’, ‘Outside’, ‘Balcony’ and ‘Suite’ by clicking the respective Edit icons and then save.
Agent Explorer showing all Capture Text commands
18. Thus far we have created the Agent to extract all the cruise information on the first page. We need to set it up to iterate through all the search result pages. To do this, we need to use the Follow Pagination command to follow each of the pages.Scroll down the page and select the ‘Next 10’ link. Then click one more time on the selected element to display the Content Grabber Message window.
Content Grabber Message windows with Follow Pagination option selected
19. Now we choose the Follow Pagination option to add the pagination command to the Agent.
Content Grabber has added the pagination command to the Agent and loads the next page on the second browser tab.
20. When we click on the pagination command we can see all the search list commands inside of the pagination command. This means our agent will now iterate through all the search result pages to extract this information.
Agent Explorer showing the contents of the Pagination command
21.We have now finished building the Agent so we should save it. To save the Agent, choose File > Save in the Content Grabber menu, and then enter the Agent Name “cruisedirect”. Then click the Save button to commit your changes.
Saving the "cruisedirect" Agent
In the next section, Refine Your Data we use Content Grabber's Content Transformation method to retrieve the Ports of Call data.