Performance Sessions

<< <%SKIN-STRTRANS-SYNTOC%> >>

Navigation:  Improving Agent Performance and Reliability >

Performance Sessions

Sessions allow you to run multiple instances of the same agent at the same time. This can be used to split up large web scraping tasks and have multiple instances of an agent working on the task.

 

Content Grabber splits up a large task by dividing list entries into subsets, and each instance of the agent will then work on one of those subsets. For example, if you are processing a long list of start URLs, Content Grabber could divide the list into two and have one instance of the agent go through the first half of the list and a second instance of the agent go through the second half of the list.

 

Agents already use multithreading internally to split up work in a similar fashion as performance sessions, but sometimes it's faster to have multiple processes working on a task rather than multiple threads, especially if you run the processes on multiple computers. A single instance of an agent processing a website using multithreading needs to wait for threads to catch up at certain points. For example, when processing pagination with multiple threads going through a list of links on each page, some threads may finish before others, but they'll need to wait for all threads to finish before the agent can move to the next page in pagination. Multiple instances of the same agent run completely independent, so one instance will never have to wait for other instances to catch up.

 

Running an Agent in a Session

To run an agent in a session, you must specify a session ID when you run the agent and the agent must be configured to support sessions. To configure an agent to support sessions, set the agent option Support Sessions to Performance Sessions Agent Settings tab.

 

performanceSessions

 

When using Performance Sessions, the session ID must be in a special format that dictates how work is divided between sessions. The input list associated with the Agent command (start command) will be divided by default, but you can specify any list command in an agent by setting the option Process in Sessions on the list command. You can only set this option on one command in an agent.

 

The special format of the session ID specifies how many sessions will work on the input list, and the subset of list entries the current instance of the agent should work on. The session ID must be in the following format.

 

[Subset to Process]/[Total Number of Sessions]

 

For example, if you have an agent that processes a list of 10 start URL, and want 5 instances of an agent to each process 2 URLs, then the session ID "3/5" would start an instance of the agent that processes URL number 5 and 6.

 

You can start multiple sessions at once by specifying a comma separated list of numbers, or a range of numbers. Here are a few examples:

 

"1-10/10" starts all 10 sessions.

"1,3-6/10" starts session 1 and 3 to 6.

1,4,5/10" starts session 1, 4 and 5.

 

A session ID can be specified when running an agent from the commandline by using the command-line option session_id. The following command-line runs an agent named sequentum with a session ID "2/10".

 

RunAgent.exe "sequentum" session_id "2/10"

 

You can also specify a session ID when you run an agent from the Content Grabber editor. Open the Run window and enter a session ID or select an existing session ID from the drop down box, and then press the Get button to make that session active. Notice: The session panel is only visible if the agent supports sessions.

 

runWithSession

Run an agent with a session from the Content Grabber editor.

 

 

You can select All Sessions from the session dropdown list and press the Get button to open a window that displays status information for all agent sessions.

 

sessionOverview

 

You delete all sessions by selecting All Sessions from the session dropdown list and then press the Delete button. You can also delete a range of sessions by specifying a session range.

 

Session Data Cleanup

Normally, data generated by sessions is cleaned up when the session expires. This is because sessions are normally always new sessions with a new session ID, so to avoid having old data hanging around forever, Content Grabber will remove session data periodically, unless you specifically tell it not to. However, performance sessions are different, since they are not always new sessions with new session IDs, so by default, Content grabber will not remove data generated by performance sessions.

 

You can manually delete one or more sessions from the Run window in the Content Grabber editor.

 

When you delete a session, Content Grabber will only clean up externally exported session data if the agent is configured to export data to a database. If you are using an Export Script or if you are exporting to a file format, then you are responsible for any cleanup of exported session data.

 

You may not always want to remove externally exported session data when you delete a session. To prevent session data from being removed, set the agent option Cleanup External Session Data to false. This option can be found in the Sessions section on the advanced options tab.