<< Click to Display Table of Contents >> Navigation: »No topics above this level« Extracting Data From Non-HTML Documents |
Websites generally provide most of their content in HTML format, but some websites may also provide content in other formats - such as PDF or Microsoft Word documents. Since Content Grabber can only process HTML documents, it will simply download any non-HTML document. Content Grabber can help you extract text and images from within a PDF or Word document by converting such documents into HTML.
To have Content Grabber convert your non-HTML document, you will need to provide an external document converter. The Content Grabber public website provides a list of open source programs that you can use for this purpose. Please remember that we don't support these tools, and you must comply with the license for any conversion tool.
The design of most file formats, including PDF and Word files, doesn't include ease-of-conversion to HTML. So, the conversion output is considerably more difficult to manage than standard HTML. In many cases, you'll have to select the entire HTML page and then use Regular Expressions to extract the target content.
Content Grabber uses a custom script that makes a call to an external document converter, and you can configure this script to call any type of program. The default script can handle the three document converters currently available for download on the Content Grabber website.
Follow these steps to install the PDF-to-HTML document converter:
1.Download the pdftohtml.zip file from the Content Grabber website:
https://contentgrabber.com/web-scraping-tools
2.Extract the content of the zip file into the default Content Grabber Converters folder, My Documents\Content Grabber\Agents\Shared\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer.
3.The direct path to the document converter should now be:
My Documents\Content Grabber\Agents\Shared\Converters\pdftohtml\pdftohtml.exe
Follow these steps to install the Docx to HTML document converter:
1.Download the docxtohtml.zip file from the Content Grabber website:
https://contentgrabber.com/web-scraping-tools
2.Extract the content of the zip file into the default Content Grabber Converters folder, My Documents\Content Grabber\Agents\Shared\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer.
3.The direct path to the document converter should now be:
My Documents\Content Grabber\Agents\Shared\Converters\pdftohtml\pdftohtml.exe
Follow these steps to install the Excel to HTML document converter:
4.Download the exceltohtml.zip file from the Content Grabber website:
https://contentgrabber.com/web-scraping-tools
5.Extract the content of the zip file into the default Content Grabber Converters folder, My Documents\Content Grabber\Agents\Shared\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer.
6.The direct path to the document converter should now be:
My Documents\Content Grabber\Agents\Shared\Converters\exceltohtml\exceltohtml.exe
You'll need to add the custom script (that calls the document converters) to a Download Document command, and that same command must also perform the download of the document.
The default conversion script looks like this:
using System;
using System.IO;
using Sequentum.ContentGrabber.Api;
public class Script
{
//See help for a definition of ConvertDocumentToHtmlArguments.
public static bool ConvertDocumentToHtml(ConvertDocumentToHtmlArguments args)
{
if(args.DocumentType.Equals("pdf", StringComparison.OrdinalIgnoreCase))
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\pdftohtml\pdftohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-noframes -nodrm");
}
else if(args.DocumentType.Equals("docx", StringComparison.OrdinalIgnoreCase))
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\docxtohtml\docxtohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "");
}
else if(args.DocumentType.Equals("xlsx", StringComparison.OrdinalIgnoreCase) || args.DocumentType.Equals("xls", StringComparison.OrdinalIgnoreCase))
{
if(args.IsDebug)
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-noimages -rows 100");
}
else
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-noimages");
}
}
else if(args.DocumentType.Equals("csv", StringComparison.OrdinalIgnoreCase) || args.DocumentType.Equals("txt", StringComparison.OrdinalIgnoreCase))
{
if(args.IsDebug)
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-text -delimiter , -noimages -rows 100");
}
else
{
args.ScriptUtilities.ExecuteCommandLine(@"Converters\exceltohtml\exceltohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-text -delimiter , -noimages");
}
}
if(!File.Exists(args.HtmlFilePath))
return false;
return true;
}
}
After converting a document to HTML, that document goes into the agent data folder and you can use a Navigate URL command to open the HTML document. The Download Document command that did the conversion will also store the path to the document, and then the Navigate URL command can use that path to get the file URL to the HTML document.
An agent with a URL command that links to a
converted HTML document
The Navigate URL command uses data from the Download Document command, so the Download Document command must execute first. Also, both the commands must have the same parent command, or the Navigate URL command must be a child command of the command that contains the Download Document command.
You can execute the Navigate URL command in the editor to open the converted HTML document, but you must first execute the Download Document command to make sure the HTML document is available.
A Navigate URL command using data captured by a
Download Document command