What should I know about robots, runs, and executions?

What are robots?

The most fundamental part of dexi.io, a robot essentially describes what to do with a Web page/site and how to do it. dexi.io offers two types of robots.

Extractor

The extractor is capable of extracting data from any site. It is fully HTML5-compliant,highly capable, and boasts the same feature set of a desktop Web browser. It handles HTML, CSS, Javascript, downloads, Web sockets, Canvas, forms, log-ins, and much more. The drawback of this rich feature support is that unnecessary features tend to lead to longer processing times and slower robots when compared to the crawler.

Crawlers

The crawler is a much simpler robot. Given a URL to begin with, the crawler automatically finds all outgoing links from that page within the same domain and traverses these pages, repeating this with each discovered page. This is basically the same technique that Google, Yahoo, and Bing use to index the Web, but our crawler is confined to a single domain per robot. The crawler doesn't support CSS, Javascript, or any other special elements. This makes it capable of being highly concurrent and very fast. But without advanced feature support, the crawler is limited by which pages it can interact with.

What are runs?

For every robot, you must have at least one run to execute it. A run is a configuration of how you want to execute it - not an execution itself.

You can have an unlimited number of executions of a single run. You can have an unlimited number of run per per robot, but for the vast majority of robots, you only need one or two. A run configuration includes concurrency, scheduling, integrations and inputs.

Integrations

The Integrations tab allows you to select which of your configured integrations this particular run should use. For every selected integration, dexi.io will upload all available formats to that integration upon successful execution.

Inputs

Inputs are especially important to understand, as inputs are often used to pass search criteria, log-in credentials, or other information to the Web site. If your robot requires input, you must add inputs to the run or the robot will fail. Adding an input looks like this in the extractor editor, in the Inputs tab:

Here is how the step looks like:

In the configuration page, add inputs using the inputs tab:

You will then get a series of results for each individual input:

To import your input values, download the CSV template dexi.io automatically creates based on your robot's input fields and copy the values.Save the CSV file and upload it using the Import CSV button to import the values.

Watching runs and robots

To be notified via e-mail or push notification when an execution succeeds or fails, you can Watch a run.To start watching simply click the Not watching button when editing a run. That'll bring up a drop down menu where you can specify what you want to watch.

To enable push notification to your smart phone or tablet devices you must connect with Pushover.net.

Monitoring your robots

If you want to monitor that your robots are in good condition you can setup smaller runs that execute daily - and then use watching to alert you if something goes wrong. This will provide you with an early warning system to keep your robots running smoothly.

What are executions?

Executions is what you get when you execute a run.

The execution contains the results which you'll see as soon as the robot extract them and also holds information about how much time it took, how much traffic and other relevant statistics.

If you look at the results tab you'll find all the results of the execution and whether they've succeeded or not.

For each result you'll see at least 1 screenshot - which is the screenshot of the last page of the execution. If your robot has 0 or 1 inputs - all the screenshots will be the same since all the results were retrieved in the same session. Screenshots are only available for scraping executions.

In the above screenshot you'll notice a Connect button. This is available because the run that this was executed from contains Integrations. Clicking Connect will re-trigger those integrations if needed.

There is also an Archive button - which will store the current result as files and disallow in-browser viewing of the result. This is done automatically after 14 days.