For every data source that has a well-documented XML layout there’s a dozen that lock it into oddly-formatted web pages or PDF documents. These five scraping tools can automate the process of making this data usable and in doing so expand the horizons of your business while saving you and your staff valuable time.
Founded in 2012, in London, import.io aims to make data scraping from website sources as simple as possible. Its cross-platform desktop application features a point-and-click interface, making it easy for even non-programmers to generate a scraper. The software is free, with curated ‘big data’ sets available to paying customers. Earlier this month, the UK-based platform received a US$3 million investment boost from a group including Jerry Yang, the co-founder of Yahoo!, who said, “import.io has the potential to revolutionise how we look at data on the web and will enable the next generation of data-driven computing.”
Used by businesses including Channel 4, The Guardian and the UK Government, ScraperWiki offers a collaborative platform for writing your own data scraper directly in your browser. The team has also recently launched Table Xtract for converting PDF tables into spreadsheet format. It is easy to use and a free account can run up to three simultaneous scrapers.
If you want to share your data scraping efforts then Kimono is for you. Using a web browser extension or bookmarklet, Kimono allows for rapid creation of data scraping application programming interfaces (APIs). These APIs are then shared between users, giving beginners a head start. The ability to keep the APIs you create private will arrive with the launch of Pro and Enterprise subscriptions, promised in the near future.
Launched from the ashes of TweetMeme, the popular but now defunct Twitter news feed service, Reading-based DataSift has a tighter focus than most data scrapers. It looks specifically at social data and gives users the ability to extract and filter content from thousands of sites including Twitter, Facebook, Tumblr and WordPress. Its data stream can be used commercially or non-commercially with a free trial period available.
Keen to take your data capturing to the next level? Scrapy is an open source data scraping framework that’s ideal for programmers who want more control over their tools. Used primarily for data mining, monitoring and automated testing, it can also be used by non-programmers via the Portia visual scraping tool add-on from Scrapinghub.com, although it will still require technical knowledge to initially set up.
Any company requires a constant stream of data to stay on top of its industry. Without data scraping software that task can be insurmountable. These tools, and other similar services, can truly save time and energy when used properly and should be in the standard toolkit of any modern business. Are they in yours?