HOME ABOUT MASHUP CAMP WIKI BEST MASHUP CONTEST NEWS SPONSORS CONTACT

ChicagocrimeAndScrapePI

From MashupCamp

Jump to: navigation, search

This presentation had two parts:

* Adrian Holovaty will present chicagocrime.org.
 * We'll go over screen-scraping and Thor Muller will lead a discussion on the emerging phenomenon of the "ScrapePI."
 

Here are notes for the talk that Thor Muller gave about ScrapePI:

What is the problem?

  • Basic services aren’t available as APIs: wikipedia, tvguide, imdb, Craig’s List!!!, and virtually all government/public domain sources
  • Page scraping is annoying/hard to maintain
  • Hacks (like this: http://www.hackdiary.com/archives/000070.html) are pretty iffy too

Examples of ScrapePIs

  • Ontok Wikipedia API
  • XMLTV (uses scrapers from many contributors)
  • Comparison shopping and Ebay examples

Practical issues

  • Changes to source pages/breaking the scraper
  • Images and media may get blocked
  • Coarse lookups only (e.g. no “what has changed” lookups)
  • Bandwidth consumption – both from source and for ScrapePI
  • Adding intermediary
  • Authentication/Cookie (see Yodlee)
  • Limitations on content use; stay complementary with source
  • Sensitive data types
  • Regulated data

Legal

  • No explicit rights to use commercial data
  • What is fair use?
  • What constitutes a unique collection of data (database)?
  • Data Protection Act (UK)
  • Check for robots.txt
  • When are there damages

Strategic

  • Without ScrapePIs people can build one-off scrapers, but there’s no public pressure/lead by example to service
  • ScrapePIs force services and data providers to open APIs
  • Niche players see value of opening data
  • New revenue opportunity?