Convert Document to HTML Scripts

<< <%SKIN-STRTRANS-SYNTOC%> >>

Navigation:  Scripting >

Convert Document to HTML Scripts

A Convert Document to HTML script is used to convert a downloaded document into a HTML page, so Content Grabber can extract data from the document the same way as for any other HTML page.

 

Please see the topic Extracting Data From Non-HTML Documents for more information.

 

A Convert Document to HTML script can be added to a Download Document command by selecting the Convert to HTML configuration tab and setting the option Convert to HTML:

 

convertDocument

 

The following example is the default Convert Document to HTML script. This script checks the type of the downloaded document, and uses an appropriate document converter to convert the document to a HTML page:

 

using System;

using System.IO;

using Sequentum.ContentGrabber.Api;

public class Script

{        

 public static bool ConvertDocumentToHtml(ConvertDocumentToHtmlArguments args)

 {

         if(args.DocumentType=="pdf")

                 ScriptUtils.ExecuteCommandLine(@"Converters\pdftohtml\pdftohtml.exe",

                         args.DocumentFilePath, args.HtmlFilePath, "-noframes");

         else if(args.DocumentType=="docx")

                 ScriptUtils.ExecuteCommandLine(@"Converters\docxtohtml\docxtohtml.exe",

                         args.DocumentFilePath, args.HtmlFilePath, "");

         if(!File.Exists(args.HtmlFilePath))

                 return false;

         return true;

 }

}

 

The function should return True if the conversion succeeds or False if the conversion failed.

An instance of the ConvertDocumentToHtmlArguments class is provided by Content Grabber and has the following functions and properties:

Property or Function

Description

string DocumentFilePath

The path of the document that needs to be converted to HTML.

string DocumentType

The type of document that needs to be converted to HTML. For example, if the document is a PDF document, the document type will be pdf.

string HtmlFilePath

The file path the script should use for the converted HTML file.

Agent Agent

The current agent.

ScriptUtils ScriptUtilities

A script utility class with helper methods. See Script Utilities for more information.

Command Command

The current agent command being executed.

IContainer ParentContainer

The parent container command of the current command.

IConnection DatabaseConnection

The current internal database connection used by the agent. This connection is already open and should not be closed by your script.

IHtmlNode HtmlNode

The extracted HTML node.

IInternalDataRow DataRow

The current internal data row containing the data that has been extracted so far in the current container command.

bool IsDebug

True if the agent is running in debug mode.

bool IsSchemaOnly

If true, only the data schema is required, so you can optimize processing by only returning the data schema with no data.

IInputData InputDataCache

All input data available to the current command.

void WriteDebug(string debugMessage, DebugMessageType messageType = DebugMessageType.Information)

Writes log information to the agent log. This method has no effect if agent logging is disabled, or if called during design time.

void WriteDebug(string debugMessage, bool showMessageInDesignMode, DebugMessageType messageType = DebugMessageType.Information)

Writes log information to the agent log. This method has no effect if agent logging is disabled, or if called during design time.

void Notify(bool alwaysNotify)

Triggers notification at the end of an agent run. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors.

void Notify(string message, bool alwaysNotify)

Triggers notification at the end of an agent run, and adds the message to the notification email. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors.

GlobalDataDictionary GlobalData

Global data dictionary that can be used to store data that needs to be available in all scripts and after agent restarts.

 

Input Parameters are also stored in this dictionary.

IConnection GetDatabaseConnection(string connectionName)

Returns the specified database connection. The database connection must have been previously defined for the agent or be a shared connection for all agents on the computer. Your script is responsible for opening and closing the connection by calling the OpenDatabase and CloseDatabase methods.

IInputDataRow GetInputData()

If the current command is a data provider, the data for that command is returned. Otherwise this function searches the command's parents and returns the first found input data.

IInputDataRow GetInputData(Command command)

If the specified command is a data provider, the data for that command is returned. Otherwise this function searches the command's parents and returns the first found input data.

IInputDataRow GetInputData(string commandName)

If the specified command is a data provider, the data for that command is returned. Otherwise this function searches the command's parents and returns the first found input data.

IInputDataRow GetInputData(Guid commandId)

If the specified command is a data provider, the data for that command is returned. Otherwise the function throws an error.