Scripting > Script Languages

The Content Grabber scripting engine supports C#, VB.NET and Regular Expressions. C# and VB.NET can be used for all types of scripts, but Regular Expressions can only be used by content transformation scripts.

Regular Expressions

This manual does not explain regular expression syntax. Please visit the following website for more information about regular expressions:

Regular Expressions Reference Guide

Regular expression scripts in Content Grabber can include any number of regex match and replace operations. Each regex operation must be specified on two lines. The first line must contain the regex pattern and the second line must contain the operation. The following operations are supported.

Operation	Description
return	Returns the first match in the original content or a selected group within the match. The returned match can be combined with static text.
return all	Returns all matches or the specified group within all matches.
return table	Returns all matches in a data table. Each captured group becomes a column in the data table. Named capture groups can be used to name the data columns.
return if match otherwise	Return a specific value if a match is found and another value if a match is not found. For example, True could be returned if a match is found and False if a match is not found. Example: return if match "True" otherwise "False"
replace with	Replaces all matches - or a selected group within the first match, in the original content and then returns the content.

A group within a match must be specified by the group number, and cannot be specified by a group name, except when using the return table operator. The group number 0 specifies the entire match, and the number 1 specifies the first group in the match. A group number must be preceded by the character $, so for example, $1 specifies the first group in a match. Use two $ characters ($$) to escape the $ character in an operation.

If a regex script contains more than one regex operation, the next operation will work on the output from the preceding operation. All regex operations are case insensitive and line breaks are ignored. The following eight special operations can be specified on a single line:

•strip_html removes all HTML tags from the content.

•url_decode decodes an encoded URL.

•html_decode decodes encoded HTML.

•unescape_string unescapes content that has been string encoded. For example, \" will be converted to " and \\ to \.

•trim removes line breaks and white spaces from the beginning and end of the content.

•line_breaks converts some HTML tags such as <P>, <BR> and <LI> into standard Windows line breaks.

•to_lower converts text to lower case.

•to_upper converts text to upper case.

•capitalize_words capitalizes all words in the text.

•aggregate [separator] if return all has been used to return a list, this operation adds all strings in the list together separated by the specified separator.

•insert_data inserts data into data templates defined in the content.

The syntax {$content_name} can be used to reference extracted data, input data, global data or input parameters. For example, if you had a capture command named product_id, you could construct the following regular expression to extract all text between the product ID and the first white space:

{$product_id}(.*?)\s

The specified name may reference a capture command name, a data provider command name, an input parameter name, or a global data name. The script will first try and find a matching capture command, then a matching data provider command, then an input parameter name, and lastly a global data name. If the data from one source does not exist or is empty, the script will proceed to the next data source.

Examples

Here are a number of examples:

Regular Expression	Description
.* return	Returns the entire match, so everything in this case.
A(.*?)B return $1	Returns the group 1 match, so everything between A and B.
(A).*?(B) return $1$2	Returns group 1 and 2 matches, so AB in this case.
A(.*?)B replace with	Replaces every instance of everything between A and B (including A and B) with nothing.
A(.*?)B replace with some new text	Replaces every instance of everything between A and B (including A and B) with "some new text".
A(.*?)B replace $1 with some new text	Replaces the first instance of everything between A and B (excluding A and B) with "some new text".
A(.*?)B replace with $1	Replaces every instance of everything between A and B (including A and B) with the text between A and B, so in effect it removes A and B.
A(.?)B return $1 C(.?)D replace with	First extracts everything between A and B, and then replaces every instance of everything between C and D (including C and D) with nothing.
^(.*) replace with A$1B	Inserts A at the beginning of the string and B at the end of the string.
<br> replace with \r\n	Replaces all <BR> HTML tags with standard Windows line breaks.
A(.*?)B return $1 url_decode	Returns the group 1 match, so everything between A and B, and then URL_decodes the result.
A(.*?)B return C$1D	Returns the group 1 match, but adds C to the beginning and D to the end before returning the match.

C# and VB.NET

This manual does not explain C# and VB.NET syntax. Please visit an online reference guide for more information about these languages.

C# and VB.NET scripts must have one class named Script with a static function that is executed by Content Grabber. The signature of the static method depends on the type of script. The following example is for a Data Input script.

using System;

using System.Data;

using Sequentum.ContentGrabber.Api;

public class Script

{

public static DataTable ProvideData(DataProviderArguments args)

{

DataTable data = args.ScriptUtilities.LoadCsvFileFromDefaultInputFolder("inputData.scv");

return data;

}

A script can have more than one function in the Script class, and can also use functionality from external .NET libraries. All external .NET libraries must be added to the agent's assembly references. See the topic Assembly References for more information.