Thursday, May 26, 2005

Querying and Extracting Web Data

It has always been a source of frustration to me that it is non-trivial to pose queries over structured web data. To take a simple example, consider the sequence of tasks I had to undertake to come up with the movie statistics in my previous post:
  1. Go to Roger Ebert's Great Movies page to find his movie list in HTML.
  2. Fire up Excel. Use its import/export facility to convert the HTML table to plain-text.
  3. Use emacs reg-exp to create uniform field separators.
  4. Fire up SQLite command line, create a table schema and import the text file.
  5. Discover SQLite import bug which screws up when the last field is of numeric type.
  6. Apply some Unix "cut" and "paste" commands to shuffle the columns of the text file.
  7. Go back to SQLite, modify table schema and import data.
  8. Type out SQL Query: SELECT year/10 as decade, count(*) FROM Ebert GROUP BY decade.
  9. Enjoy results.
In hindsight, I should probably have just written a Perl script to parse the HTML and load the data directly into SQLite via the Perl DBI driver. Or even faster, just stare at the web page and manually compute my statistics. The above task is actually relatively simple because I was able to import all the data I needed and query it afterwards. But I often want to do much more complex queries that can't afford to crawl all external data ahead of time.

Here is an example query: Find all movies in my local database that I have given a high rating to (yes, I have a database of movies I've seen), but which have a much lower rating at IMDB. This query requires a join between data available locally (movie title and my rating) and data off the web (IMDB rating). It is fairly straightforward to write a script that executes this query, but what would be really nice is a good tool that lets us simply pose such queries and can automatically (ok, with a teeny bit of human input) go off and figure out how to execute them -- a simple, usable query engine over a data integration system that works with web sources.

On the bright side, Alon Halevy's Semex project at Washington appears to promise all that I want. (Worryingly, Alon has a tendency to avoid my Stanford talks like the plague. Maybe the mere fact of my mentioning this problem is enough to turn him off this line of research? :-))


Post a Comment

<< Home