CMSC 725 Fall 2015 project descriptions. If more information is needed, please ask the Teaching Assistant Shangfu Peng. are not enough by themselves and thus would logically be combined with related projects. All projects are individual projects. There are NO group projects with the possible exception of the Google Watch NewsStand project. The goal of each project is to be able to get a research paper out of it. This is not a requirement but has happened a number of times in the past and is something to consider when choosing a project. NEWSSTAND-RELATED PROJECTS: We have developed a number of systems for spatio textual search. 1. STEWARD: http://steward.umiacs.umd.edu/ 2. NewsStand: http://newsstand.umiacs.umd.edu/news/light 3. TwitterStand: http://twitterstand.umiacs.umd.edu/news The key idea behind all of these systems is the identification of words that correspond to geographic locations (e.g., does "Jordan" refer to the name of a person like "Michael Jordan" or does it refer to the country or river). If it refers to a geographic location, then we need to identify the appropriate instance (e.g., is "London" in England, Canada, etc.?). This work involves some linguistics and is generally known as "geotagging". Once the words have been correctly identified, we plot our results on a map. In other words, we associate documents with the locations that are contained with them. In other words, we can retrieve documents by location. More importantly, the query location is usually specified graphically by pointing at it using a map query interface. The advantage of using such an interface is that we need not know the exact name of the location. We just need to point at a nearby location. In other words, we are able to make use of synonyms in our queries, where the synonyms are restricted to being "spatial" (i.e., names of geographic locations). This is an example of "approximate search". STEWARD works with a collection of documents and attempts to identify all instances of geographic locations in each document and enables viewing a document with a map as well as finding all documents that contain a particular keyword. In addition, an attempt is made to determine a geographic focus for a document which is the most important location in the document from the point of view of being the subject of the document. Currently, the system works for a collection of documents from the following sources: A. HUDUSER: A collection of documents. B. PubMed: Abstracts of published biomedical articles. C. ProMED-mail: Mailing list postings about infectious disease outbreaks. D. News: Collection of news (old) NewsStand works with a set of 10,000+ HTML RSS news feeds. It cleans up the HTML so that the articles, images, and videos can be extracted. It then uses the TF-IDF representation of the articles to cluster them using an online clustering algorithm. A number of factors go into the clustering. In addition, the topics of the articles are determined (one of a small set). Images and videos are extracted and collected for each cluster and the articles are geotagged. The main version of NewsStand is web-based. However, a separate implementation of NewsStand also exists on a smartphone like the iPhone and the Android. TwitterStand works with Tweets. The key idea is to capture tweets that correspond to late breaking news. The result is analogous to a distributed news wire service. The difference is that the identities of the contributors/reporters are not known in advance and there may be many of them. The tweets are not sent according to a schedule. The tweets occur as news is happening and are noisy while usually arriving at a high throughput rate. Some of the issues include removing the noise, determining tweet clusters of interest bearing in mind that the methods must be online (i.e., clusters whose elements are constantly changing over time thereby reflecting the news cycle which is one of increasing volume as the topic becomes known, followed by a more steady decline as interest naturally fades), and determining the relevant location associated with the contents of the tweets, rather than the locations from where the tweets are sent. This is quite challenging and makes heavy use of the techniques developed for the NewsStand system. % HJS: Shangfu - please be precise below as to what we have access to. % This is in terms of the Gardenhose, Firehose, and what we get % from Twitter. We have access to Gardenhose (define), Firehose (define) Possible Projects: 1. Introduce news summaries into NewsStand. This is similar to a system by a 17 year old kid that was recently sold to Yahoo. In some sense the pyramid style of news writing is such that one can cut off the article at any point and have a summary. A question is the quality of the summaries provided by the commercial system over just using the result of the pyramid style. 2. Incorporate a broader search capability into NewsStand. Currently, the search command only deals with words that appear in the headline. In other words, only they are indexed. The problem is that indexing all terms will require much work and storage. This is what Google does. Devise some form of lazy evaluation for indexing so perhaps that indexes are only built for articles that have been fetched by queries. This could be broadened to index all articles that appear in a cluster one of whose element articles has been fetched by a query. 3. Adapt STEWARD to deal with the Wikipedia. In this case, the result is a tool for looking at Wikipedia articles and providing a way to access the Wikipedia by location. Note that you will be identifying the locations using techniques similar to what is currently being used in STEWARD and NewsStand. Note that you are not using the "geotags" often provided in conjunction with the Wikipedia articles. However, you could do an evaluation of your own geotagging with that of the preassigned geotags. This is not simple in the sense that you need to be sure that you are doing something that does not already exist. Thus you will need to expend some tome investigating what is currently available. 4. Adapt STEWARD to deal with documents of the World Bank. For this you need to have a contact at this organization. There may be a chance of doing this due to contacts with Ihab Ilyas. This is a long shot project. 5. Adapt STEWARD to deal with documents of a government agency. For this you need to have a contact at the agency. 6. Adapt STEWARD to deal with a set of documents on disease spread monitoring. This is already in existence but could be beefed up. Rangjan Lan did such a project with ProMed mail in 2011 and ended up with a paper at the HEALTHGIS'12 workshop. See http://www.cs.umd.edu/~hjs/pubs/healthgis12.pdf In particular, something like this could be done by reaching out to Rita Colwell. Again, this is a long shot project which will require some initiative. 7. Adapt STEWARD or NewsStand to monitor mentions of brands in the news. In the case of NewsStand this could be done by adding a "Brand" layer much in the same spirit as currently done for the disease and people layers. One possible source for names of brands is the website brandtags.com which lists not only brands but the most common words associated with them. After poking around a little online, one of the most comprehensive lists is found at http://www.kelkoo.co.uk/bd-brands. Its hard to find an authoritative source. Since brands are somewhat fluid and being created and destroyed all the time, it would be tough to maintain a current up-to-date list. This project came at the suggestion of Prof. William Rand of the Smith School of Business. This has been done in the Fall of 2013 by Ahmed who got a demo paper about it accpeted at the 2015 SIGSPATIAL Conference. 8. Develop methods to determine the "geographic focus" of a document, individual news articles, and also clusters of news articles and compare them. In particular, you will need to conduct an evaluation. The efficacy of your method may differ depending on the underlying input. 9. NewsStand currently uses a single server. The architecture of NewsStand is described in the following paper http://www.cs.umd.edu/~hjs/pubs/newsstand-arch.pdf a broader description is given in: B. E. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. NewsStand: A new view on news. In GIS'08: Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 144-153, Irvine, CA, November 2008. http://www.cs.umd.edu/~hjs/pubs/newsstand-acmgis2008.pdf and in the October 2014 issue of Communications of the ACM. Investigate quality of service issues for NewsStand. In other words, investigate how many users the current implementation of NewsStand can support and the query mix that would yield a meaningful performance study. Determine if the quality of the service can be improved by making use of a cloud-like architecture such as available on Amazon and develop a prototype. 10. Investigate what aspects of the NewsStand pipe system could be sped up via use of GPUs. For example, look at the document ingestion system (i.e., cleanup module) as described in a paper that we wrote found at /nfshomes/codepoet/work/papers/cleaner/main.pdf Alternatively can we use several servers on the cloud to perform this task. The idea is to speed up the document ingestion component possibly by distributing it so a number of servers are cleaning up the documents in parallel or possibly using a GPU to do the task. 11. Investigate whether latency (distance from UMD) has an effect on the performance of NewsStand (both app and web versions). Investigate the use of a mirror site. Also Investigate how quickly data received by the NewsStand clients can be rendered in NewsStand's interface. The speed of rendering greatly affects NewsStand's perceived interactivity. Determine if the rendering time can be shortened somehow. 12. Investigate the use of GPUs for clustering the news articles. Some work has been done on this already by Ben Teitler. The goal is to obtain speedups over a sequential approach. The speedup should be more than a factor of 2, which is what was obtained by Ben Teitler. DONE! 13. Investigate a distributed clustering algorithm for NewsStand. The idea is to distribute the computation of the similarity function (cosine) with every feature vector and figuring out which ones can be pruned. This is an alternative solution to the one using GPUs in 12, although it may also make use of GPUs. Note that GPUs have a SIMD (single instruction, multiple data) architecture, with many cores all executing the same program, but with different data input. All the processors are able to access a single large shared memory space, which is relatively slow, as well as smaller cache memories on individual processors, which are faster. GPUs are optimized for arithmetic-heavy tasks which are frequently encountered in graphics processing, which makes them suitable for heavy number crunching. On the other hand, distributed architectures (e.g. MapReduce) involve many separate computers, each with dedicated and possibly non-uniform hardware, which communicate over a network. Both the GPU and the distributed architectures are readily programmable using general purpose programming languages. 14. Suppose that each user of NewsStand has access to a device with computing power as is planned for the NVIDIA Shield tablet device which has a local GPU and that we own and you can borrow. Can you download some of the functionality of NewsStand on these local clients. For example, different RSS news feeds can be ingested by these local clients. Devise a scheme where, in essence, NewsStand is being outsourced. 15. NewsStand's geotagging module makes a number of errors. In particular, we have observed repeated errors for certain locations. Your job is to run NewsStand extensively and identify entities that are wrongfully identified as locations (e.g., "George" in South Africa and other first names that are often wrongfully interpreted as geographic locations) or are associated with the wrong geographic location (e.g., "Georgetown" being placed in Guyana instead of being associated with "Georgetown University" in Washington D.C.). Some example locations are provided below. This is not exhaustive but gives you some idea of the scope of the problem. Georgetown Northern District Batman Man Barcelona Santa Cruz Valencia Alexandria Athens Columbus Columbia Paris London Berlin Moscow Salem Springfield Dover Newark Augusta Portland St. Louis Dublin Hollywood George San Jose Norwalk Edgewood Portland La Paz Manchester Washington San Miguel Santiago Cairo Damascus Arlington Richmond Palestine York Winchester Chester Bristol Rome Vienna Brentwood Oakland Riverside Algier Geneva Monaco Frankfurt Lexington Concord Dover One possible approach is to have a specialist for the various locations. For example, in the case of "Georgetown" the idea is to be able to recognize all possible usages of it when it is the name of the University in Washington, DC. In this case, the references are often to the athletic teams. It would be interesting to see the extent to which the general geotagging module in NewsStand can be improved, if at all, using these specialized packages. Several students can be assigned this project with a competition at the end to see who does best. SHANG FU PENG has looked at this problem but no definitive solution. Zachary Williams is also currently looking at it. 16. Investigate the utility of time-dependent geotagging where the resolution of a possible toponym depends on its current usage. For example, "George" may appear a lot in the news at a certain time and thus an interpretation as a location may be less probable. 17. NewsStand has a topic detection module, which identifies the general topic of an article (e.g., business, sports). Evaluate its quality and compare with other work in the literature. Try to develop a better method. SAMET AYHAN tried this topic in 2011 but inconclusive results. 18. NewsStand currently only works for articles in English. This means that only news sources in English are handled. The scope of NewsStand could be significantly increased by including articles in languages other than English. The problem here is that these articles would not be properly clustered as the TF-IDF mechanism is language sensitive. Also, geographic names differ in the various languages. One suggestion is to use English as the canonical language and use Google Translate to convert the foreign language articles to English prior to clustering them. This would be done only internally while the output of NewsStand would still refer t pairs (i.e., hash table), such that given any pair of source and destination vertices, and an estimation error bond \epsilon, the use of an approximate distance oracle of size O(n \epsilon^2) enables on to obtain the \epsilon-approximate nearest neighbor distance by making O(log n) accesses to the hash table. The term n is the number of vertices in a road network. There are couple of research problems to solve here: a) Distance oracles are expensive to compute since you need the distances between every pair of vertices in the road network. However, there are many obvious ways of cutting corner here to achieve real speedups. Can you figure out a way of scaling these oracles to work on a data set spanning the whole of USA? b) Can you then map the distance oracles to a key-value store to enable the use of cloud computing infrastructures such as MapReduce for performing operations on large road network datasets? Also, can you reduce the number of access to be one or two rather than O(log n). This is an easy part since the mechanics of it are quite easy. The outcome of this project could be a really useful service on the web where thousands of people can pose distance queries and can get approximate network distances in fractions of seconds. Shangfu Peng has pretty much done this. Need to check with him if any of them still remain viable suggestions at this point. 5. Incremental nearest neighbor algorithm: A. for Frechet distance See SIGSPATIAL'12 paper titled A GPU approach to sub-trajectory clustering using the Frechet distance Joachim Gudmundsson, Nacho Valladares B. Incremental nearest neighbor algorithm for Earthmover distance See papers by David Jacobs and Leo Guibas 6. Look into Charikar and Broder's work on LSH (Kanellakis award) and figure out what how it is related to the example in Samet's "Foundations of Multidimensional and Metric Data Structures" as well as try to incorporate it into the text. Charikar's paper on simhash: M. Charikar. Similarity estimation techniques from rounding algorithms. Proceedings of STOC. pp. 380-388. 2002. http://www.bradblock.com.s3-website-us-west-1.amazonaws.com/Similarity_Estimation_Te+chniques_from_Rounding_Algorithms.pdf It could be that the example in Section 4.7.4 on pages 711-716 in Samet's Foundations of Multidimensional and Metric Data Structures that uses binary digits is in fact simhash. For a good treatment of simhash, see http://matpalm.com/resemblance/simhash/ Broder's paper on minhash: A. Z. Broder. On the resemblance and containment of documents. Proceedings of IEEE Conference on Compression and Complexity of Sequences. pp. 21-29. 1997. http://gatekeeper.dec.com/ftp/pub/dec/SRC/publications/broder/positano-final-wpnums.pdf 7. Provide simple pseudo code Yi and Samet ICDE'13 paper in the style of Samet's algorithms and see if it can be extended to metric spaces. 8. Look into structured graphs work of Leila De Floriani and see if it can be applied to social networks. Need to look up her papers. 9. Spatially-oriented social network identification. In other words, investigate how to incorporate knowledge of spatial proximity (also temporal proximity). 10. Is it possible to use differential privacy in a spatial context? SHANGFU PENG is an expert here. 11. Parallel graph traversal in a distributed computing environment. The problem of graph traversal is concerned with searching through graph nodes to find solutions with respect to a certain optimization criteria. For example, the problem of finding the k nearest supermarkets with respect to a location q in a road network can be modeled as a problem of traversing the road network to find a supermarket p (in a set of supermarkets) which minimizes the network distance D(q,p). This project is concerned with applying the distributed computing frameworks, e.g, MapReduce, Distributed GraphLab, to formulate different graph traversal algorithms for different spatial-network query problems, e.g., range, nearest neighbor, reverse nearest neighbor, skyline and distance join queries. 12. Survey of spatio-textual join papers and work. a. ICDE 2013 b. IEEE TKDE "A Prefix-Filter based Method for Spatio-Textual Similarity Join" by Liu, S. ; Tsinghua University, Beijing ; Li, G. ; Feng, J. To appear. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6517849 Try to embed such a query into NewsStand 13. Connected component labeling on GPU and maybe other operations. E.g., http://scholar.google.com/scholar_url?hl=en&q=http://link.springer.com/article/10.1007/s11265-013-0780-0&sa=X&scisig=AAGBfm2l2qu-nTRZNaFxYCtXYksN8KX-dg&oi=scholaralrt 14. Can you adapt the following methods to spatial data like quadtrees or some other appropriate variant: Optimizing bitmap indices with efficient compression K Wu, EJ Otoo, A Shoshani. ACM Transactions on Database Systems (TODS) 31 (1), 1-38. Look at other work by K. (John) Wu especially On the performance of bitmap indices for high cardinality attributes. K Wu, E Otoo, A Shoshani. Proceedings of the Thirtieth international conference on Very large data 2004, 24-35. Also see P. Negparkar, S. Candan, and A. Bhat. Compressed spatial hierarchical bitmap (cSHB) indexes for efficiently processing spatial range query workloads. PVLDB 8(12):1382--1393, August 2015. 15. Reimplement the SAND Internet Browser using the Vertica community edition. This means substituting the SAND homegrown database by the Vertica database. Is this possible? Need to ask Jagan and also if it is worth it in the sense of what we would gain from doing so. This would mean that the various algorithms for computing the incremental nearest neighbor and range queries would need reimplementing. This also means investigating how to use a pointerless quadtree in this environment or if one would use a pointer-based quadtree. One option is to store quadtrees in an array representation where pointers are array indexes. Problem is that there is little ability for random access and need to start at the root. Also, does not support updates 16. Ask Vertica for some of the queries that their spatial customers would like to execute and see if we can develop algorithms. Were worried about representing polygons with holes. How to do with PMR and PM2 quadtrees. Tried this a number of times but no response. 17. Does it make sense to use MongoDB for spatial data such as the SAND Internet Browser? 18. K. Wu results on connected component labeling: Kesheng Wu, Ekow Otoo, and Kenji Suzuki Optimizing two-pass connected-component labeling algorithms Pattern Analysis and Applications. June 2009, Volume 12, Issue 2, pp 117-135 Claim to be fastest method of doing connected component labeling. Compared it with the following algorithm: as being of the same type as the Dillencourt/Samet/Tamminen algorithm. However, its not clear that Florio and Gustedt implemented the Dillencourt/Samet/Tamminen algorithm. So, a comparison with the Wu/Otoo/Suzuki may be worthwhile. Fiorio C, Gustedt J. Two linear time union-find strategies for image processing. Theor Comput Sci 154(2):165181. 1997. 19. Look at the maps in http://www.washingtonpost.com/blogs/blogpost/post/40-maps-that-explain-the-world/2013/08/12/5775d214-0364-11e3-bfc5-406b928603b2_blog.html and show how the information that they capture could be presented using queries in the SAND Internet Browser. This means expressing the underlying data as spatial relations and then formulated queries using the primitives available in the SAND Internet Browser possibly using the functionality of SAND-Tcl. See http://www.cs.umd.edu/~hjs/pubs/sandtcl.ref.pdf 20. Fix Mapping Apps of Apple Maps, Google Android Maps, and iOS Google Maps to show entire world at full zoom out level. In other words, enhance API and App to show whole world as in iOS5. Otherwise, one cannot get the notion of browsing a newspaper to really work as too slow and cannot get the big picture. 21. Investigate notion of implementing a point quadtree variant with a splay strategy. In other words, how do you splay the data points? It seems similar to the notion used by Samet in See H. Samet. Deletion in two-dimensional quad trees. Communications of the ACM, 23(12):703--710, December 1980. to delete where we make the closest node to the deleted node the root of the new point quadtree. The splay tree of Tarjan and Sleator does this by rotation operations while we do it with one operation. However, the result is the same in the sense that the replacement node becomes the new root node. The only difference is that we actually rebuild the underlying point quadtree with the new root, while in the one dimensional splay tree, we move the deleted (or accessed) node to the root via rotation operations which preserve the structure of the binary search tree. The whole idea is analogous to a self-organized file. This was motivated by listening to Eunhui Park's Ph.D. defense although she was unaware of this relationship. QUESTION: Can Samet's point quadtree deletion algorithm be expressed in terms of splay operations? QUESTION: Samet observed that, empirically, the total path length decreased as a result of the deletion operation. Also a logarithmic execution time was observed. a. Can you prove the logarithmic deletion time? b. What about an amortized logarithmic deletion time? c. Can you prove that the total path length decreases in 2d? What about in 1d? 22. Eunhui Park Thesis defense observations: a. Look into the scapegoat tree which is similar to the splay tree but involves no rotations (p. 13). b. The treap is analogous to the priority search tree except that it uses a binary search tree on the values and a heap on priorities (assigned to the values at random so that we don't have the non-randomness of the values which means that sorted data might not be a random property). In essence, the structure is nothing more than a Cartesian tree. The concept is analogous to the skip list except that no lists are used. c. The splay quadtree is a misnomer. It is really a splay BBD-tree but it does not sound so exciting. There cannot be a structure such as a splay quadtree (really a PR quadtree), because the structure's shape is independent of the order of insertion. The BBD-tree can be splayed because the aggregation process is decoupled from the partition process. d. Pointless to talk of having a splay red-black tree as this is nothing but a 2-3 2-3-4 tree (order 3 or order 4 B-tree). The fact that these corresponding B-trees are balanced makes it pointless to ask about how to splay them as all this can achieve is rotation and is much more complex to specify in the framework of a red-black tree than in the framework of a B-tree. e. It would be nice to have an example of a skip quadtree (p. 19) using the point data from Samet's data. f. Explain Chan's method using Samet's data (p. 22). g. Check Fair Split Tree of Kosaraju and Callahan (pp. 23-24) and see if it can be shown using Samet's data. h. Give an example of spanners (p. 24) using Samet's data. i. Explain concept of doubling dimension (pp. 25-26). j. Ask Jagan about homogeneous well-separation (p. 80). k. Ask Jagan about the well-separated pair construction algorithms in the thesis. l. What is the relation of net trees and r-nets to traditional metric space data structures like mb-trees, etc.? NOT WELL-DEFINED PROJECTS: 1. Volunteer geographic information systems a. Workshop at SIGSPATIAL'12 b. http://scholar.google.com/scholar_url?hl=en&q=http://onlinelibrary.wiley.com/doi/10.1111/tgis.12039/full&sa=X&scisig=AAGBfm05uni8BuefGPiF3h4MpWZAMeZaOQ&oi=scholaralrt 2. Sentence Similarity with spatial data. E.g., http://scholar.google.com/scholar_url?hl=en&q=http://www.aclweb.org/anthology/N/N13/N13-1089.pdf&sa=X&scisig=AAGBfm0jdx1nhl4y7qi1i3SMaZ0Wf5xLRQ&oi=scholaralrt 3. C. Garth, K. Joy. Fast, Memory-Efficient Cell Location in Unstructured Grids for Visualization. IEEE TVCG 16(6), 1541-1550, 2010. Mentioned in http://www.iwr.uni-heidelberg.de/groups/CoVis/Teaching/VNT_SS13/vis2-4_effDatenstrukturen.pdf 4. A. Telea. Data Visualization. AK Peters Ltd., 2007. Mentioned in http://www.iwr.uni-heidelberg.de/groups/CoVis/Teaching/VNT_SS13/vis2-4_effDatenstrukturen.pdf 5. See VTK Visualization Tooklit (Kitware) 6. Text summarization via Hidden Markov Models (Dianne O'Leary SIGIR 2001) http://dl.acm.org/citation.cfm?id=384042 Mike Szaszy is a candidate for working on this topic 7. Measure clustering quality Mike SZASZY project 8. Look into paper on toponym resolution and see what they do. Text-Driven Toponym Resolution using Indirect Supervision M. Speriosu and J. Baldridge Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 1466-1476, Sofia, Bulgaria, August 4-9 2013. http://scholar.google.com/scholar_url?hl=en&q=http://acl.eldoc.ub.rug.nl/mirror/P/P13/P13-1144.pdf&sa=X&scisig=AAGBfm1kAADgLLcMnUTnjIfQEU_hnq0aEw&oi=scholaralrt This paper encourages using the presence of clues like "lobster" to help resolve "Portland" in favor of "Maine" over "Oregon". This comes out of linguistics literature. Examine it. This is also related to use of concepts by Gianluca. 9. Fiona from Australia sent following reference: Kjetil e Vaage (2012) Sensing the News: User Experiences when Reading Locative News. Future Internet 2012, 4, 161-178; doi:10.3390/fi4010161 10. Look into the European LEILAS "EruoStars" project at http://www.leilas.eu/ This website has information about work plans for the project. Try to find out what they actually accomplished. This means finding technical reports, etc. At the moment, all that the site contains are descriptions of tasks for a proposal. It was started in 2011. So, there should be some results. One four page paper in French, which is lacking in details, can be found at http://scholar.google.com/scholar_url?hl=en&q=http://www.viseo.net/sites/default/files/viseo-group/article-leilas_representation_geospatiale_d_ensemble_de_documents_textuels-ecg2013.pdf&sa=X&scisig=AAGBfm3KqeEfEYf5cqS94Xnq53QuYLOHYw&oi=scholaralrt 11. The GeoXLS (ACMGIS'11) system is a search engine that allows users to submit a set of locations as a query object Q and to find documents containing locations similar to those in Q. Search results come from a collection of geotagged web documents, specifically a vast collection of spreadsheets obtained from the Web. The results are ranked according to their similarity to Q, where similarity is measured using the Hausdorff distance or modified Hausdorff distance (a variant that reduces the effect of outliers). For example, given a point set query consisting of locations where a disease outbreak has been reported, one could search a database of previous outbreaks to find others with similar geographic distribution. The search algorithm iteratively computes estimates of the distances between the query point set and the collection being searched, to prune the search space without computing the Hausdorff distance from the query set to every other point set in the collection. a. The system currently searches point sets from a collection of spreadsheets downloaded from the Web. Identify other possible collections of point sets to search, and extend GeoXLS to search these point sets. One possible source of point sets is the Wikipedia. b. Construct a GPU implementation of the incremental Hausdorff search algorithm utilized by GeoXLS. c. The Hausdorff distance and modified Hausdorff distance are both computed by identifying the nearest neighbor of each point, but a single point can be used as the nearest neighbor of many points in the other point sets (for example, a point set with 100 points in College Park and 1 point in London will be "similar" to a point set with 1 point in College Park and 100 points in London by these measures). Identify a distribution-sensitive similarity measure between a pair of point sets that limits this effect, and extend GeoXLS to support it. 12. Apply methods such as Hausdorff distance computation to the problem of Hurricane detection and tracking similar ones. This is quite vague. It requires a person who is a domain expert in this area. 13. Compare NewsStand with Pulse and Flipboard and other online news aggregation sites. What enhancements could NewsStand use? 14. Spatial keyword queries - see SIGMOD'13 paper and references cited therein Given a location and set of key words a. Find all or minimum set of objects that covers a set of keywords. b. Could minimize area of covering region in a 15. Spatial join papers of Alimaki group at EPFL - SIGMOD'13, EDBT, SSDBM'13, etc. 16. SIGMOD'13 paper in Session 21 on Tweets and finding all garage sales withing 24 miles of a specific location. The idea is to constantly update the query results as events come up rather than static. Currently, updates only happen in TwitterStand/NewsStand when the map moves a bit. 17. SIGMOD'13 paper on spatial keyword searches where want objects with certain properties from a minimal spatial area, etc. Not very novel but could do some extensions. 18. Look into road network paper at SIGMOD'13 that improves on Samet SIGMOD'08 paper. 19. Raymond Wong paper at SIGMOD'13 about bottleneck problem and compare with Efrat and Itai paper ---------------- NSF Key Value Store Proposal Tasks ---------------- RESEARCH TIMELINE: Year 1: A. Point Cloud 1. Derive a data partitioning scheme for a large point cloud. 2. Formulate a set of parallel algorithms for querying large point clouds. 3. Evaluate the efficiency of the proposed parallel algorithms by analyzing factors like the degree of parallelism, communication cost and granularity. B. Loose Quadtree 1. Implement a distributed variant of the loose quadtree for indexing moving objects. 2. Derive parallel query algorithms for the distributed loose quadtree. 3. Evaluate the effectiveness of the distributed loose quadtree by comparing it to grid-based moving object indexes. C. Multiresolution Range Query 1. Formulate a distance-preserving hashing technique which enable range queries to be evaluated via simple lookup operations. 2. Compare and contrast the performance of the proposed technique to a classical solution based on hierarchical indexes. 3. Evaluate the effectiveness of the proposed hashing technique by applying it to real-world online map applications. Year 2: Distance Oracle 1. Investigate the possibility of mapping distance oracles to a key-value store. 2. Devise a method to provide access to the key-value store representation of the distance oracle using one or two accesses, most of the time. 3. Demonstrate the applicability of our methods to Internet-scale applications by studying the throughput, average access times, and the quality of the answers. 4. Deploy the distance oracle in complicated query processing scenarios, especially those in location-based services and local searches. 5. Devise a set of publicly available distance oracles for the entire United States for different values of $\epsilon$. Year 3: Distributed Graph Traversal 1. Derive a parallel graph traversal method method for a message-passing-based distributed computing framework like Pregel. 2. Derive a parallel graph traversal method which does not require communication between tasks for the MapReduce framework. 3. Formulate parallel algorithms for different query problems (e.g., range, nearest neighbor, reverse nearest neighbor, and distance join queries) using the derived methods. 4. Compare and contrast the performance of the messaging-passing-based algorithms with algorithms that do not require communication between parallel tasks in different processing environments. The main impact of this project is expected to be a spatial database management system which supports a wide range of spatial query types and is capable of "scaling out" to handle a large amount of spatial data. This will enable spatial applications such as online mapping, computer aided design, online gaming and scientific simulations to store terabytes of spatial data and to execute queries such as shortest path and self-join on them. Thus the improvement will lie in the scale and range of the sizes of the data sets that we will be able to handle. ----------------- NSF GeoMultiMedia Proposal Tasks ----------------- RESEARCH TIMELINE: Year 1: Geotagging and Clustering Improvements for News and Tweets a. Identify features to be used in a machine learning based approach b. Evaluate results of using machine learning vis-a-vis current rule-based approach c. Compare results with existing systems such as OpenCalais and Placemaker d. Develop methods to identify reliable seeders for TwitterStand e. Develop GPU-based document clustering based on an analogy to sparse matrix multiplication The evaluation of the effectiveness of these components will be ongoing. In the first year, we will construct ground truths which consist of news articles that we deem to contain representative articles with the type of geographic locations that we wish to recognize and resolve correctly. In addition, we will make use of the error feedback mechanism to collect geographic locations that the system has consistently failed to recognize or has resolved incorrectly. These articles will form a set which will be tested from time to time to see if subsequent system improvements result in the errors disappearing. Year 2: Image Similarity Using the Associated Text a. Devise system to retrieve newsworthy images and videos using a map query interface b. Develop techniques for context-based image similarity search c. Identify image transformation types that result in partial-duplicate images d. Apply partial-duplicate image detection as filtering mechanism for image queries. The evaluation will be conducted using a ground truth which is a set of images that we have been collecting during the first and second year that are representative of the type of similarity which we would like the system to be able to detect. The evaluation will be ongoing with feedback for enhancement. Year 3: Use of Language Translation Service for News Text a. Evaluate translation services (e.g., Google Translate) in terms of preservation of the cluster property b. Evaluate effectiveness of using translation of cluster centroids as cluster representatives for foreign language documents c. Evaluate use of non-native language documents to find similar clusters in the native language d. Use Arabic language news web sites to evaluate translation quality e. Identify reliable seeders for TwitterStand in other languages such as Arabic. The evaluation in this context is more of a qualitative nature in the sense that what we are doing is trying to incorporate the results of the translation service and determining if it makes sense to do so and if so, in which components of our system. The focus on Arabic in the case of seeders for TwitterStand is primarily because we have access to a collaborator who has done much work in processing news sites in the Arabic language. We hope to draw on his insight in processing Arabic tweets. The identification of seeders will be ongoing with the final evaluation done in the third year. Identification will use factors such as being the first to tweet on an important news topic, the number of tweets by others on the same topic, the number of different topics tweeted on, and the extent to which the seeder re-tweets (a negative factor). Of course, we also need to define the newsworthiness of the topic. This is done by making use of the url that the tweet usually points at and whether there are existing news articles in its cluster. The evaluation is done over an extended time period. POSTDOC The postdoc will be working on the geotagging module which is a key to the success of the system. The postdoc will focus on the machine learning component in the first two years while also investigating the feasibility of the development of a language-independent geotagger. This will most likely be based on machine learning as the rule-based techniques have a large language-dependent component. The postdoc will also work on investigating the extent to which the translation preserve clustering. -------------------------- 2015 New Ideas -------------------------- 1. Read the following IEEE PAMI June 2015 article on image similarity: IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 37, No. 6, June 2015 What Can Pictures Tell Us About Web Pages? Improving Document Search Using Images By Sergio Rodriguez-Vaamonde, Lorenzo Torresani and Andrew W. Fitzgibbon Read the full article at http://www.computer.org/web/tpami?lf1=5391298952c991816059537a42786516 Traditional Web search engines do not use the images in the HTML pages to find relevant documents for a given query. Instead, they typically operate by computing a measure of agreement between the keywords provided by the user and only the text portion of each page. In this paper we study whether the content of the pictures appearing in a Web page can be used to enrich the semantic description of an HTML document and consequently boost the performance of a keyword-based search engine. We present a Web-scalable system that exploits a pure text-based search engine to find an initial set of candidate documents for a given query. Then, the candidate set is reranked using visual information extracted from the images contained in the pages. The resulting system retains the computational efficiency of traditional text-based search engines with only a small additional storage cost needed to encode the visual information. We test our approach on one of the TREC Million Query Track benchmarks where we show that the exploitation of visual content yields improvement in accuracies for two distinct text-based search engines, including the system with the best reported performance on this benchmark. We further validate our approach by collecting document relevance judgments on our search results using Amazon Mechanical Turk. The results of this experiment confirm the improvement in accuracy produced by our image-based reranker over a pure text-based system. This work is similar in spirit to the paper J. Sankaranarayanan, H. Samet Images in news. In Proceedings of the 24th International Conference on Pattern Recognition, pages 3240-3243, Istanbul, Turkey, August 2010. Categories: [spatio-textual search engine, Twitter] See http://www.cs.umd.edu/~hjs/pubs/news-icpr.pdf The goal here is to compare with NewsStand and to get a paper published at least in the ICPR'2016 conference or even CVPR and a journal paper in IEEE PAMI Journal. 2. Using NewsStand and TwitterStand for disaster/emergency management. Look at what Jagan and Hong put together for the Nepal Earthquake. 3. Modeling newspapers by their local lexicons 4. User studies with researchers at the University of Salerno on the one-handed interface 5. Google watch NewsStand implementation on Motorola (round face) 6. Google watch NewsStand implementation on Samsung (square face) 7. Apply Samet/Lindenbaum analysis in SIAM Journal to three dimensional data to see if PMR quadtree as well as PM1 quadtree and MX octree have similar complexity bounds as the two-dimensional variants 8. Analyze Samet's point quadtree deletion algorithm which makes use of a splay-like operation to see if can prove that the total path length is decreased as a result of the operation. Also see if the complexity of this implementation of deletion is indeed O(log N) as observed by Same in his 1980 CACM article. See 9. Shangfu image gallery for instagram. Any other applications? 10. Read the article "Making the Mobile Web Faster" by K. Matsudaria, Communications of the ACM, Vol. 56, No. 3, March 2013, pages 56-61. http://dl.acm.org/citation.cfm?id=2428572 Various techniques to make web-based mobile applications faster. It would be interesting to try these techniques on the two versions of NewsStand: 1. App version 2. Mobile version This could be attempted for all three of the iPhone, Android, and Windows Phone. The author suggests a number of methods. This could be done as a group project of several teams with the winner being given the higher grades. One of the key research issues here is how to make map-based applications on the web faster since much data is repeatedly transferred to the device. So, some issues are: 1. Minimizing connections and data across the network 2. Minimizing image requests (i.e., map tiles) 3. Leveraging local storage and caching 4. Pre-fetching and caching data (how different from item 3?) 5. Use non-blocking I/O (Explain relevance here) 6. Avoiding redirects and minimizing DNS lookups 7. Use http pilelining and SPDY (not sure about this!) 8. Sending the right data: a. Use limit and offset to get results b. Support partial response and partial update c. Avoid or minimize cookies d. Establish device profiles for APIs (e.g., different results for different form factors) 11. Read the article: A Scalable Content-Addressable Network by S. Ratnasamy, P. Franics, M. Handley, R. Karp, and S. Shenker SIGCOMM 2001 It describes the CAN method which makes use of a quadtree representation. Look at the paper more closely and determine what quadtree research results (probably nearest neighbor computation, and linear quadtrees) could be used here to improve the algorithms, etc. This could be a dead-end project in the sense that there is nothing substantive in it. Whoever takes this project needs to have an exit plan in terms of an alternative project should this one not pan out. 12. Read the article Ptolemaic indexing ML Hetland - Journal of Computational Geometry, 2015 Abstract This paper discusses a new family of bounds for use in similarity search, related to those used in metric indexing, but based on Ptolemy's inequality, rather than the metric axioms. Ptolemy's inequality holds for the well-known Euclidean distance, but is also ... http://scholar.google.com/scholar_url?url=http://www.jocg.org/index.php/jocg/article/download/42/85&hl=en&sa=X&scisig=AAGBfm0vUOB6jdwnrhEsEgRKEHICSdhdUQ&nossl=1&oi=scholaralrt Definition of Ptolemaic distance: Even though most distance-based indexing has centered on the metric axioms, this paper focuses on another property, known as Ptolemy's inequality, which states that xv \cdot yu \leq xy \cdot uv (1) for all objects x; y; u; v. Premetrics satisfying this inequality are called Ptolemaic. [2]. In the terms of the Euclidean plane: For any quadrilateral, the sum of the pairwise products of opposing sides is greater than or equal to the product of the diagonals (see Fig. 1). It is a well-known fact that Euclidean distance is Ptolemaic [15]. There are many proofs for this, some quite involved, but it can be shown quite simply, using the idea of inversion [16]. Definition of a premetric=dis-similarity function Project is to see if incremental nearest neighbor method of Hjaltason and Samet work in this context. The paper mentions these algorithms but not clear if they work. 13. Read the article Bodo Manthey, Smoothed Analysis of Local Search Algorithms. Keynote talk at WADS'15, Victoria, British Columbia, Canada, Aug. 2015. http://wwwhome.math.utwente.nl/~mantheyb/misc/WADS2015_Manthey_LocalSearch.pdf Look into it and find connection to the iNN algorithm of Hjalstason and Samet, if any. 14. Read the article JiTTree: A Just-in-Time Compiled Sparse GPU Volume Data Structure M. Labschutz, S. Bruckner, M.E. Groller, and M. Hadwiger IEEE TVCG August 15, 2015 online postin Abstract: Sparse volume data structures enable the efficient representation of large but sparse volumes in GPU memory for computation and visualization. However, the choice of a specific data structure for a given data set depends on several factors, such as the ... May be of interest for volume data structures for GPU usage. Need to look at more carefully. Compare with PK-tree or adapt PK-tree to this environment. A possible CMSC 725 project. MOST RECENTLY ADDED TOPICS on September 1, 2015. 1. Trendsspottr analog - one for news and one for tweets 2. Expand on work of Rongjian Lan for the spatiotemporal NewsStand. The first item of business is to fix the Tour. Next, try to include news articles over a longer period of time and use his visalizer in the form of a heat map. This would show the migration of articles about a specific cluster over time and space. Rongjian Lan's 2014 SIGSPATIAL Workshop on HEALTHGIS is a starting point. 3. Research on caption generation (see 2015 MSR Faculty Summit) 4. Compare News Agregators mentioned in the September 2015 issue of CACM with NewsStand and enhance NewsStand to contain some of them. 5. Implement NewsStand (Adapt it) to run on the Google Watch. There are two variants. The SAMSUNG and the MOTOROLA. We could have a competiion where we have teams of two students. 6. Enhance WeiboStand to have same UI as TwitterStand and same clustering. Also make use of Microsoft Translator. It appeared in the 2014 SIGSPATIAL Workshop on Location-Based Social Networks and was a project in Fall 2013 by Cheng Fu. 7. Implement TRUE semantic caching in NewsStand and differentiate from Chung. Need REAL performance data. This paper appeared in the 2013 SIGSPATIAL Conference 8. Determine sun exposure for a particular house via data from sources such as Google Street View, Zilow, others? Also for a simple tent with four bags to hold it and you want to see the shaded area as time goes on. 9. Modify WAZE and navigation programs to understand elevations where the targets have different heights like differentiating between the arrival and departure level at Ben Gurion airport in Israel for Eldan car return as well as levels in a parking structure, etc. In other words, we should modify the representaion of the road network to include such features. For exmaple, look at SILC and the distance oracle. 10. "Last mile" navigation systems. Modify one of them such as WAZE to take into account parking in determining a path to your destination. In other words, the destination includes more than one point. This route could be optimized by a minimum price for the time period and the distance from the true destination. Thus we are adding temporal factors to the shortest path computation: a. Minimize parking cost and distance from true destination in terms of walking, taking shuttle bus, etc. b. Give duration of stay as a parameter in order to determine the optimal path This could be used for determining a route to the airport which can be further differentiated by whether you will use valet parking, a shuttle bus, etc. What we see here is a muh more detailed shortest path and navigation plan. Currently, programs like Google Directions take into account factors such as navigation mode (e.g., walking, driving, and public transit) but not the "last mile" which is where people really get lost. Our oracle model shows this to be true where we see that the shortest paths between many sources and destinations are very similar. 11. Look at Bruno Martins references for geotagging and apply to NewsStand and evaluate. 12. Enhance cleanup paper which is a system that cleans up RSS feed web pages and extracts images, videos, etc, especially in terms of excluding ads. A paper was written about the NewsStand cleanup system by Ben Teitler but reviewers asked for more work. Shangfu Peng knows about its current state. The same could also be applied to cleaning up videos. Larry Davis's former Turkish student Ismail Haritaoglu has a company that is interested in such a technology as they serve videos to major sports networks. 13. Transcription of videos in NewsStand so that can cluster their contents better. c. Position of the SUN 14 ORCAM glasses for Alzheimer patients so that only keep track of people in your photo album and not record what you see. Instead, if you see someont, you check your own personal photo library to see if they have been seen before and who they are. Saves embarassament of not reconginizing your spouse. 15. Windowseat 16. WAZE social network for finding gas stations and toilets. PROJECTS SUGGESTED BY SHANGFU PENG ON September 1, 2015: We have built the following systems: 1. STEWARD: http://steward.umiacs.umd.edu/ 2. NewsStand: http://newsstand.umiacs.umd.edu/ 3. TwitterStand (new): http://twitterstand.umiacs.umd.edu/News/ 4. PhotoStand: http://newsstand.umiacs.umd.edu/photostand-jack/ 5. Geollery (Instagram Images): http://sametphp.umiacs.umd.edu/geollery/ 6. ASDO: http://sametnginx.umiacs.umd.edu/ Possible Projects: 1. Generalized Quad-tree for Database System. This is Similar to Gist in PostgreSQL. You need to develop the general methods and implement them for a Quad-tree generalized data structure. Then users can use your data structure to easily specify numerous quadtree variants such as the MX-Quadtree, PM-Quadtree, PM2, PM3, Edge Quadtree, PMR Quadtree, loose Quadtree, etc. Related Work is the SP-GIST work of Aref et al. If you do this project, you need to be much better! W.~G. Aref and I.~F. Ilyas. An extensible index for spatial databases. In {\em Proceedings of the 13th International Conference on Scientific and Statistical Database Management}, pages 49--58, Fairfax, VA, July 2001. W.~G. Aref and I.~F. Ilyas. {SP-GiST}: an extensible database index for supporting space partitioning trees. {\em Journal of Intelligent Information Systems}, 17(2--3):215--240, December 2001. Also see {\em Proceedings of the 13th International Conference on Scientific and Statistical Database Management}, pages 49--58, Fairfax, VA, July 2001. Aslo see the paper titled "Generalized Search Trees for Database Systems" J.~M. Hellerstein, J.~F. Naughton, and A.~Pfeffer. Generalized search trees for database systems. In {\em Proceedings of the 21st International Conference on Very Large Data Bases ({VLDB})}, U.~Dayal, P.~M.~D. Gray, and S.~Nishio, eds., pages 562--573, Zurich, Switzerland, September 1995. \url{http://gist.cs.berkeley.edu/}. Some of the key Gist methods are: Consistent, Union, Compress, Decompress, Penalty, PickSplit, Search, Insert, Split, Delete, etc. Gist can implement: B+ tree, R-tree, hB-tree, RD-tree 2. Dynamic Gazetteer Built Up from Social Media and News Motivation: This is the first step of “spatial intelligence”. GeoName.org is currently the largest geographical database that spans all countries and contains over eight million place names. However, it is a static gazetteer. Upon encountering a toponym, both GeoNames and NewsStand generate all the possible interpretations for it and order them by population. Instead, we want to vary the order of the interpretations. One possibility is by using the number of their recent occurrences. Another could be the identity of the user (person or news source) who is querying the gazetteer. Still another could involve the time period in which the query is being posed. Another ordering could be based on the average understanding of people for the toponym. If we think of the gazetteer as a well-read person, then the probabilities are equivalent to his spatial knowledge background. All previous gazetteers assume that the person's knowledge background is static. However, in fact, the spatial knowledge background is really dynamic as it varies over time. Moreover, the spatial knowledge background should depend on the person with whom the well-read person is conversing. One solution is that “dynamic” has several meanings: A. At a certain time or period, some words may imply a certain location. For example, during the period that Brazil is hosting the FIFA world cup, the word "soccer" may imply Brazil. Before the next soccer game for country A, the occurrence of the word pair of "A soccer" may imply the city in Brazil that where the next match for country A is being held. Similarly, the names of some famous people names may also refer to a city or a location when the people visit this city. In addition, if a big event happened in a small town, e.g., Paris, Texas, then when people say “Paris”, it refers to Paris Texas with a higher probability than to Paris, France. Thus “dynamic” means that the interpretations of each toponym may vary over time so that some new interpretations come into play while others vanish. In additions, the probabilities of the interpretations may change B. The probabilities may also vary according to the identity of the users. For example, when a person who lives in or travels to Paris, France say “Paris”, it always refers “Paris France” rather than “Paris Texas”, and vice versa. Newsstand, TwitterStand, and Instagram can be used to develop the dynamic gazetteer. In these systems, although there are many articles with which no location information is known, we have GOS information associated with some of the articles and/or tweets, so that we know with 100% certainty the related location interpretation. The news articles, tweets, and photos for which we have the 100% certainty information will be use to build the dynamic gazetteer on top of the static gazetteer. 3. a. Discovering regions of different functions using social media postings. b. Discovering users that are spatially similar using social media postings. In the NLP community, people focus on pairs (words, document) to discover topic-models on a document set. In the GIS community, we usually do a partition on the spatial domain which is indexed using for example, a quadtree, R-tree, and Grid based partition. On the other hand, the daily activities of people as chronicled by their social media posts are usually associated with geotags. Thus, by analogy, for (a) we have pairs such as (social media activity, spatial region) and for (b) we have pairs such as (spatial region, user). We can either treat every spatial region as a “document”, or treat every user as a “document”. In this case, we can borrow some NLP methods, especially topic-model, to get some interesting results. The social media data can be from Twitter (TwitterStand), Instagram (Geollery), or any other sources. For (a), you need to find the region's functions in terms of their associated social media postings. For (b), you need to find similar users based on similar spatial regions which are based on their social media postings. A related paper from Microsoft Asia: Discovering Regions of Different Functions in a City Using Human Mobility and POIs. This paper explores the pair (human mobility and POIs, spatial region) to discover regions of different functions, e.g., residence, governmental and public organizations, business, shopping mall, etc. This could be helped via the aid of crowdsourcing methods. 4. Embedded web map API Provide embedded web map API that mapping text contents provided by users to map, so that let user easily create a map web page that show their own contents. It is like http://www3.clustrmaps.com/ or https://www.revolvermaps.com/ that records the visitors for a website. In this project, you need to provide the embedded codes to user, then user can use your codes to show their own contents on a map. For example, WikiCFP (a wiki calls for papers, http://www.wikicfp.com/cfp/) may want to show the the hold places of recent calls for papers conferences in a map (like Google Maps). But it is a little waste time if they build the page from zero. In this project, you are building a framework that help internet scale users automatically generate a beautiful web page to show the contents specified by users. Note that your framework should understand spatial words, i.e., detect place names, and mapping place names to latitude/longitude location. NOT USED: 1. Investigate the influence that using road network distance metric or Euclidean distance metric in spatial ranking results. In particular, what kinds of spatial queries are good to use road network distance metric, and what kinds of spatial queries are good to use Euclidean distance metric. You should provide enough theoretical and experimental evidences. Background: there are a lot of location-based services today. Many of them just use Euclidean distance metric as ranking function for spatial queries, e.g., find the several nearest restaurants in Yelp. However, in this case, users prefer to know the drive distance (road network distance) indeed. In case of that you can compute road network distance as quickly as Euclidean distance, what do things change? Our ASDO demo can quickly return the shortest distance between two points in the USA region, e.g., 60000 queries per sec. And we also provide ASDO web API to do that. For example, using the web API to get the result such as CURL http://sametnginx.umiacs.umd.edu/usaapi?lat1=43.45&lon1=-101.99&lat2=40.15&lon2=-94.57 You can use this API to get the road network distance (Option).