Web Scraping

Cloud-based vs PC/Server-based Web Scraping

There is a lot of talk about cloud-based web scraping at the moment. While we predominantly sell the software, we also host scraping solutions and data for our clients. We thought it important to sift through all the hype and highlight the differences.

First up, it's important to understand that most "cloud-based" solutions on the market today aren't fully cloud-based. The majority install software on your PC and the data retrieved is stored on the cloud server. Hence, they are not machine independent.

Let's look at some of the pros and cons of each approach.

Cloud-based Web Scraping

Pros

  • If it’s a true cloud-based service, it can run from any OS and any browser.
  • You don’t have to host anything yourself. Everything is done in the cloud.
  • All the website page views, data normalization, and transformation gets handled on someone else’s server.
  • Web proxy requirements are managed for you.
  • True cloud solutions are machine independent – they can be accessed and run without install from any PC with Internet access around the world.

Cons

  • Ongoing monthly fees linked to the amount of data you extract.
  • As your web-scraping needs grow, it can get very expensive. These can quickly outweigh the price of buying your own software.
  • Most cloud-based solutions on the market today require a download file to your machine or a browser extension to work. So they are not platform independent or machine independent.
  • Complex websites (e.g. those using AJAX or heavy JavaScript) often can’t be handled by true cloud solutions.
  • Data security may be an issue.
  • If the provider shuts down (e.g. Kimono Labs) your business operations can be impacted and your data lost (i.e. limited timeline given to access you data).
  • There will be restrictions applied on what websites you can download data from.
  • Data for collection is usually only available to users in document format (i.e. Excel, XML or CSV files) or through an API, and not able to be directly fed through to a database.
  • Some of the free “automatic” cloud-based offerings fail miserably on non-basic websites. They give you limited control over the data extraction and really only serve the purpose to get your “foot in the door” and introduce you to a paid service.

PC/Server-based Web Scraping

Pros

  • One time up-front license fee.
  • No monthly fees linked to the amount of data you extract.
  • No restrictions on what websites you can scrape data from.
  • No restrictions on the amount of data you can download.
  • Data can be saved to your own database including existing database structures.
  • Data can be saved to your own websites real-time.
  • Ability to integrate with third party applications via API and or web services.
  • You can manage your own data security.

Cons

  • Need to host the web scraping agents.
  • May need to manage proxies to deal with website blocking.

Summary

In general a cloud-based solution is a good choice for individuals or organizations wanting to try out web scraping and understand how it works. If your data extraction needs are limited and from non-complicated websites, then this service (particularly the free service – if it works) is ideal for you.

If you are expecting your data extraction needs to grow and you want greater control over your data and what websites you can scrape data from, then a PC/Server software solution is a better choice – both technically and financially.