Issues in Understanding, Indexing, Querying, and Visualizing Spatio-textual Spreadsheets

Numerous organizations including government agencies are sitting on mountains of spreadsheet data that are becoming increasingly common on the web, but whose contents remain out of reach via search engines because direct links to the contents of their constituent cells are rare. Thus spreadsheet data represent legacy databases, especially since many of their underlying schemas are no longer accessible. The goal of this research is to discover the schema according to which the spreadsheet is constructed. The focus is on the spatio-textual spreadsheet which is a spreadsheet where the values of the spatial attributes are specified textually. Such spreadsheets support spatial searches whose output is visual and whose utility is enhanced by being able to handle spatial synonyms. This is done, in part, by devising methods to automatically discover the spatial attributes of the spreadsheet as well as how to distinguish between several instances of them which arise due to the presence of a containment hierarchy. In particular, use is made of spatial coherence which is manifested by observing that spatial data in the same column are usually of the same spatial type, while spatial data in the same spreadsheet row usually exhibit a containment relationship. Moreover, adjacent or nearby rows exhibit spreadsheet coherence in that they are usually similar. The broad impact of this research is to make spreadsheet data a first class citizen on the web with the same chances of being discovered and accessed as data found in other documents.

NSF Grant IIS-10-18475

PI: Hanan Samet

    Relevant Publications:

    1. M. D. Adelfio, H. Samet
      Schema extraction for tabular data on the web.
      PVLDB, 6(6):421-432, April 2013.[link]
      Also Proceedings of the 39th International Conference on Very Large Data Bases (VLDB)
      Categories: [spreadsheets]

    2. M. D. Adelfio, H. Samet
      Structured toponym resolution using combined hierarchical place categories.
      In R. Purves and C. Jones, editors, Proceedings of 7th ACM SIGSPATIAL Workshop on Geographic Information Retrieval (GIR'13), pages 49-56, Orlando, FL, November 2013.[link]
      2013 GIR'13 Best Paper Award
      Categories: [spatio-textual search engine]

    3. M. D. Adelfio, H. Samet
      GeoWhiz: Using common categories for toponym resolution.
      In C. A. Knoblock, P. Kröger, J. C. Krumm, M. Schneider, and P. Widmayer, editors, Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 542-545, Orlando, FL, November 2013.[link]
      Categories: [spatio-textual search engine]

    4. M. D. Adelfio, H. Samet
      Itinerary retrieval: Travelers, like traveling salesmen, prefer efficient routes.
      In R. Purves and C. Jones, editors, Proceedings of 8th ACM SIGSPATIAL Workshop on Geographic Information Retrieval (GIR'14), Dallas, TX, November 2014.[link]
      Categories: [spatio-textual search engine]

    5. M. D. Adelfio, H. Samet
      Automated itinerary visualization.
      In Y. Huang, M. Gertz, J. C. Krumm, J. Sankaranarayanan, and M. Schneider, editors, Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Dallas, TX, November 2014.[link]
      Categories: [spatio-textual search engine]

    Theses

    1. M. D. Adelfio
      Automated Structural and Spatial Comprehension of Data Tables.
      PhD Thesis. University of Maryland, Department of Computer Science, 2015. [link]

    2. M. D. Lieberman
      Multifaceted Geotagging for Streaming News.
      PhD Thesis. University of Maryland, Department of Computer Science, 2012. [link]

    Programs/Software

    1. GeoWhiz. A demonstration of our method for interpreting lists of place names (such as those found in a spreadsheet or table column). The method attempts to identify a common thread of all entries in the list, which can be used to more accurately interpret ambiguous place names.

    2. Spatial Browser for Wikipedia Categories. An alternative way of browsing location sets. Rather than showing each location as a marker on a single map, we show a closeup of each location. This allows for the comparison of visual attributes (via satellite imagery) of each location.

    3. PhotoStand. An image based browser that enables the use of a map query interface to retrieve news photos associated with news articles that are in turn associated with the principal locations that they mention, based on the data from the NewsStand system.

    4. NewsStand. An example application of a general framework that enables people to search for information with a map-query interface. The NewsStand system monitors the output of more than 10,000 RSS news feeds and incorporates new articles within minutes of publication. Each article undergoes a geotagging procedure, where location references are identified and interpreted, allowing us to associate each article with the geographic locations that it mentions.

    5. GeoXLS. A geographic search system that enables users to submit a set of locations as a query object Q and to find documents containing locations similar to those in Q. The demonstration system supports searching location sets derived from Wikipedia categories, news clusters (the NewsStand dataset), and a large collection of geographic spreadsheets and data tables.

    6. Ontuition. A system for filtering and mapping ontologies. The demonstration application uses an ontology of medical and drug trials, sourced from the NIH-administered website clinicaltrials.gov.