CAPTCHA Blocking

<< <%SKIN-STRTRANS-SYNTOC%> >>

Navigation:  CAPTCHA & IP Blocking >

CAPTCHA Blocking

A website can implement CAPTCHA blocking by using a web form that the user must submit to gain access to any restricted areas of the site. The web form is usually quite simple, consisting of an image element and a text box element. The image displays some characters which the user must enter into the text box in the exact sequence as given in the image. A human user can read the text in the CAPTCHA image, but a web-scraping agent requires special character recognition software to successfully discern the characters in the image.

 

CaptchaBlocking

A typical registration form with CAPTCHA blocking

 

Content Grabber performs both manual and automatic data extraction from websites that implement CAPTCHA blocking. Automatic data extraction requires an account with a third-party CAPTCHA recognition service and, typically, there is a small fee for processing each CAPTCHA image. Manual data extraction is free, but requires you to manually decode CAPTCHA images while running a data extraction agent.

 

Manual CAPTCHA Configuration

 

You have two options when configuring manual CAPTCHA. The easiest one to configure is the option we describe below. The other option uses the same approach as automatic CAPTCHA configuration, but instead of using a script to resolve the CAPTCHA, a window is displayed allowing the user to manually resolve the CAPTCHA.

 

Manual CAPTCHA processing is easy to configure, but requires you to manually decode CAPTCHA images while the agent is running. The agent will pause and display the browser window where a user can view the CAPTCHA image and enter the CAPTCHA text in the text box. You can configure the agent to automatically submit the web form after the user has entered the CAPTCHA text, or you can reply on the user to submit the form while the agent is paused.

 

If CAPTCHA blocking is part of a larger registration form, you can process the CAPTCHA part manually and let the agent process the rest of the form automatically. In this case you should let the agent submit the form automatically rather than relying on the user to submit the form while the agent is paused.

 

Follow these steps to pause an agent when a CAPTCHA image is displayed:

 

1.Add an Execute Script command to your agent.

2.Select the CAPTCHA image element in the web browser. This sets the command's web selection.

3.Select the default script type Pause Agent.

4.Select the default script condition If Selection Exists.

5.Save the command.

 

This command will pause the agent when the CAPTCHA image element exists on the web page, and allow a user to enter the CAPTCHA text.

 

You must add this command to all locations in the agent where CAPTCHA blocking could be encountered.

 

Important: Manual CAPTCHA processing relies on a human user to decode the CAPTCHA image, so an agent using manual CAPTCHA configuration cannot be run from the scheduler or API, or any other fully automated way.

 

Automatic CAPTCHA Configuration

 

Automatic CAPTCHA processing requires an account with a third party CAPTCHA recognition service. The third party recognition service must provide a .NET API and you must add an OCR script that uses this API to call the service. See the section below for two examples of CAPTCHA recognition services.

 

Follow these steps to configure an agent for automatic CAPTCHA processing:

 

1.Add a new Group Commands command. This group of commands will handle CAPTCHA.

2.Add an Execute Script command to the group. This command will skip CAPTCHA processing if the CAPTCHA image doesn't exist in the web page.

i.Select the CAPTCHA image in the web browser. This sets the command's web selection.

ii.Select the default script type Exit Command.

iii.Select the default command Parent Command.

iv.Select the default condition If Selection Missing.

3.Add a Download Image command to the group. This command will download the CAPTCHA image and use an OCR script to decode the image into plain text.

i.Select the CAPTCHA image in the web browser. This sets the command's web selection.

ii.Open the OCR tab and check the option Convert image to text.

iii.Add an OCR script. See below for script examples. If you want to manually resolve the CAPTCHA, you can check the option Convert image manually in which case you should not specify a script.

4.Add a Set Form Field command to the group. This command will use the converted image text to set the CAPTCHA text box.

i.Select the CAPTCHA form field on the web page. This sets the command's web selection.

ii.Clear the command option Use default input.

iii.Set the data provider to Captured Data and select the CAPTCHA image command from step 3.

5.Add a Navigate Link command to the group. This command will submit the CAPTCHA form.

i.Select the CAPTCHA form submit button. This sets the command's web selection.

6.Add an Execute Script command to the group. This command will retry CAPTCHA processing if the CAPTCHA image still exists on the web page. If the CAPTCHA image still exists we assume it's because the CAPTCHA recognition service decoded the CAPTCHA image incorrectly and we'll try again.

i.Select the CAPTCHA image in the web browser. This sets the command's web selection.

ii.Select the default script type Retry Command.

iii.Select the default command Parent Command.

iv.Select the default condition If Selection Exists.

 

 

captchaCommands

Automatic CAPTCHA configuration

 

 

The command library includes the group of commands listed above. Select the Automatic CAPTCHA command from the library and add it to your agent. After you have added the group command from the library, you need to set the web selection for all the commands that require a web selection.

 

You must add this group of command to all locations in the agent where CAPTCHA blocking could be encountered.

 

CAPTCHA OCR Scripts

 

Content Grabber includes the API and standard OCR scripts to call the following CAPTCHA recognition services.

 

http://www.deathbycaptcha.com

 

http://bypasscaptcha.com

 

At the time of writing, the Death by CAPTCHA service charges US$6.95 for 5000 CAPTCHAs and Bypass CAPTCHA charges US$34 for 5000 CAPTCHAs. We are not affiliated with these companies in any way and don't charge any additional fees for these services.

 

The following OCR script uses the Death by CAPTCHA service to decode CAPTCHA images.

 

public static string ConvertImageToText(ConvertImageToTextArguments args)

{

    string captcha = DeathByCaptchaService.DecodeCaptcha(args.Image, "login""password");

    return captcha;            

}

 

The login and password in the script above is provided by Death by CAPTCHA.

 

The following OCR script uses the Bypass CAPTCHA service to decode CAPTCHA images.

 

public static string ConvertImageToText(ConvertImageToTextArguments args)

{

    string captcha = BypassCaptchaService.DecodeCaptcha(args.Image, "key");

    return captcha;            

}

 

The key in the script above is provided by Bypass CAPTCHA.

 

Troubleshooting

If CAPTCHA fails when entering a correct CAPTCHA it may be caused by the following issue.

 

Content Grabber cannot directly get the CAPTCHA image from the web browser, so it downloads the image a second time, and that may be disallowed by the web server, especially if you are using a proxy rotation service where a new IP address maybe assigned for the image download. To overcome this problem, you can use a Download Screenshot command instead of a Download Image command, in which case a second image download is not required.