Finding Data Tables on the Web
I'm slightly (fashionably?) late to this party, but I just came across a new website called GraphWise that sets out to be the search engine for tabular data. In a recent press release, they state, “…if you want to search for videos, you go to YouTube, and if you want music, you go to iTunes. If you're looking for tables of data we aim for users to go to GraphWise.” The comparison may not be entirely accurate since YouTube and iTunes search only their own catalogs, but the vision has some potential if they can pull it off.
Currently, when I look for a data set on the web, I start with these standard tactics:
- Google Search, by keyword only
- Google Search, by keyword with file type qualifier (e.g., filetype:csv)
- Delicious Search, by keyword
- Delicious Search, by tag (e.g., publicdata)
- Data “Repository” Search, such as Swivel, Data360 or ManyEyes
GraphWise provides an additional option to find data. It apparently spiders data (from HTML tables, CSV files, licensed sources and user uploads), then imports and normalizes the data and, ultimately, develops graphs based on the data (similar to Swivel or Data360). I rarely have need for auto-generated visualizations, but I really like the fact that they provide the URL to the original source table. With Kirix Strata™, it's obviously a piece of cake to just import the raw table and start using it.
I did have some trouble finding useful data sets based on my search queries (forgivable, as the service is still in beta). For instance, in my previous blog post, we needed to find area code data in tabular format. So, I searched for US Area Codes in GraphWise, but got nothing even close to what I was looking for. For a simpler example, I search for Apple's stock price. It looks like GraphWise licenses historic stock information from a company called CSI, but only displayed the data in bite-sized chunks. I know I can easily download the full set of Apple's historical stock data via CSV at Yahoo Finance, but that wasn't listed as a resource.
It appears GraphWise has done well with the spidering technology to identify and capture table information across the web. The next big step will be to make the search queries more relevant. Because HTML and CSV files aren't often linked to directly, it would be really difficult to apply the kind of PageRank algorithm that makes Google so valuable. I can imagine some other issues as well, like trying to separate a table name (if available) and the actual text within a given table. Hopefully they'll be able to overcome these hurdles; it would be great to have a Google-like place to identify tabular data on the web.
(via Swivel)