CMSC 725 Fall 2015 project descriptions.  If more information is
needed, please ask the Teaching Assistant Shangfu Peng.  are not
enough by themselves and thus would logically be combined with 
related projects.  All projects are individual projects.  There are NO
group projects with the possible exception of the Google Watch
NewsStand project.  The goal of each project is to be able to get a
research paper out of it.  This is not a requirement but has happened
a number of times in the past and is something to consider when
choosing a project. 

NEWSSTAND-RELATED PROJECTS:

We have developed a number of systems for spatio textual search.

1.  STEWARD:  http://steward.umiacs.umd.edu/
2.  NewsStand:  http://newsstand.umiacs.umd.edu/news/light
3.  TwitterStand:  http://twitterstand.umiacs.umd.edu/news

The key idea behind all of these systems is the identification of
words that correspond to geographic locations (e.g., does "Jordan"
refer to the name of a person like "Michael Jordan" or does it refer
to the country or river).  If it refers to a geographic location, then
we need to identify the appropriate instance (e.g., is "London" in
England, Canada, etc.?).  This work involves some linguistics and is 
generally known as "geotagging".  Once the words have been correctly
identified, we plot our results on a map.  In other words, we
associate documents with the locations that are contained with them.
In other words, we can retrieve documents by location.  More
importantly, the query location is usually specified graphically by
pointing at it using a map query interface.  The advantage of using
such an interface is that we need not know the exact name of the
location.  We just need to point at a nearby location.  In other
words, we are able to make use of synonyms in our queries, where the
synonyms are restricted to being "spatial" (i.e., names of geographic
locations).  This is an example of "approximate search".

STEWARD works with a collection of documents and attempts to identify
all instances of geographic locations in each document and enables
viewing a document with a map as well as finding all documents that
contain a particular keyword.  In addition, an attempt is made to
determine a geographic focus for a document which is the most
important location in the document from the point of view of being the
subject of the document.  Currently, the system works for a collection
of documents from the following sources:

A.  HUDUSER:  A collection of documents.
B.  PubMed:  Abstracts of published biomedical articles.
C.  ProMED-mail:  Mailing list postings about infectious disease outbreaks.
D.  News:  Collection of news (old)

NewsStand works with a set of 10,000+ HTML RSS news feeds.  It cleans
up the HTML so that the articles, images, and videos can be
extracted.  It then uses the TF-IDF representation of the articles to
cluster them using an online clustering algorithm.  A number of
factors go into the clustering.  In addition, the topics of the
articles are determined (one of a small set).  Images and videos are
extracted and collected for each cluster and the articles are
geotagged.  The main version of NewsStand is web-based.  However, a
separate implementation of NewsStand also exists on a smartphone 
like the iPhone and the Android.

TwitterStand works with Tweets.  The key idea is to capture tweets
that correspond to late breaking news. The result is analogous to a
distributed news wire service. The difference is that the identities
of the contributors/reporters are not known in advance and there may
be many of them. The tweets are not sent according to a schedule. The
tweets occur as news is happening and are noisy while usually arriving
at a high throughput rate. Some of the issues include removing the
noise, determining tweet clusters of interest bearing in mind that the
methods must be online (i.e., clusters whose elements are constantly
changing over time thereby reflecting the news cycle which is one of
increasing volume as the topic becomes known, followed by a more
steady decline as interest naturally fades), and determining the
relevant location associated with the contents of the tweets, rather
than the locations from where the tweets are sent. This is quite
challenging and makes heavy use of the techniques developed for the
NewsStand system.
% HJS:  Shangfu - please be precise below as to what we have access to.
%       This is in terms of the Gardenhose, Firehose, and what we get
%       from Twitter.
We have access to Gardenhose (define), Firehose (define)

Possible Projects:

1.  Introduce news summaries into NewsStand.  This is similar to a
    system by a 17 year old kid that was recently sold to Yahoo.  In
    some sense the pyramid style of news writing is such that one can
    cut off the article at any point and have a summary.  A question
    is the quality of the summaries provided by the commercial system
    over just using the result of the pyramid style.

2.  Incorporate a broader search capability into NewsStand.
    Currently, the search command only deals with words that appear in
    the headline.  In other words, only they are indexed.  The problem
    is that indexing all terms will require much work and storage.
    This is what Google does.  Devise some form of lazy evaluation for
    indexing so perhaps that indexes are only built for articles that
    have been fetched by queries.  This could be broadened to index
    all articles that appear in a cluster one of whose element
    articles has been fetched by a query.

3.  Adapt STEWARD to deal with the Wikipedia.  In this case, the
    result is a tool for looking at Wikipedia articles and providing a
    way to access the Wikipedia by location.  Note that you will be
    identifying the locations using techniques similar to what is
    currently being used in STEWARD and NewsStand.  Note that you are
    not using the "geotags" often provided in conjunction with the
    Wikipedia articles.  However, you could do an evaluation of your
    own geotagging with that of the preassigned geotags.  This is not
    simple in the sense that you need to be sure that you are doing
    something that does not already exist.  Thus you will need to
    expend some tome investigating what is currently available.

4.  Adapt STEWARD to deal with documents of the World Bank.  For this
    you need to have a contact at this organization.  There may be a
    chance of doing this due to contacts with Ihab Ilyas.  This is a
    long shot project. 

5.  Adapt STEWARD to deal with documents of a government agency.  For
    this you need to have a contact at the agency.

6.  Adapt STEWARD to deal with a set of documents on disease spread
    monitoring.  This is already in existence but could be beefed up.
    Rangjan Lan did such a project with ProMed mail in 2011 and ended
    up with a paper at the HEALTHGIS'12 workshop.  See
    http://www.cs.umd.edu/~hjs/pubs/healthgis12.pdf

    In particular, something like this could be done by reaching out to
    Rita Colwell.  Again, this is a long shot project which will
    require some initiative.

7.  Adapt STEWARD or NewsStand to monitor mentions of brands in the
    news.  In the case of NewsStand this could be done by adding a
    "Brand" layer much in the same spirit as currently done for the
    disease and people layers.  One possible source for names of
    brands is the website brandtags.com which lists not only brands but
    the most common words associated with them.  After poking around a
    little online, one of the most comprehensive lists is found at
    http://www.kelkoo.co.uk/bd-brands.  Its hard to find an
    authoritative source.  Since brands are somewhat fluid and being
    created and destroyed all the time, it would be tough to maintain
    a current up-to-date list.  This project came at the suggestion of
    Prof. William Rand of the Smith School of Business.  This has been
    done in the Fall of 2013 by Ahmed who got a demo paper about it
    accpeted at the 2015 SIGSPATIAL Conference.

8.  Develop methods to determine the "geographic focus" of a document,
    individual news articles, and also clusters of news articles and
    compare them.  In particular, you will need to conduct an
    evaluation.  The efficacy of your method may differ depending on
    the underlying input.

9.  NewsStand currently uses a single server.  The architecture of
    NewsStand is described in the following paper
    http://www.cs.umd.edu/~hjs/pubs/newsstand-arch.pdf

    a broader description is given in:    

    B. E. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H.
    Samet, and J. Sperling.  NewsStand: A new view on news.  In GIS'08:
    Proceedings of the 16th ACM SIGSPATIAL International Conference on
    Advances in Geographic Information Systems, pages 144-153, Irvine, CA,
    November 2008.  http://www.cs.umd.edu/~hjs/pubs/newsstand-acmgis2008.pdf
    and in the October 2014 issue of Communications of the ACM.

    Investigate quality of service issues for NewsStand.  In other
    words, investigate how many users the current implementation of
    NewsStand can support and the query mix that would yield a
    meaningful performance study.  Determine if the quality of the
    service can be improved by making use of a cloud-like architecture
    such as available on Amazon and develop a prototype.

10. Investigate what aspects of the NewsStand pipe system could be sped
    up via use of GPUs.  For example, look at the document ingestion
    system (i.e., cleanup module) as described in a paper that we
    wrote found at /nfshomes/codepoet/work/papers/cleaner/main.pdf

    Alternatively can we use several servers on the cloud to perform this task.  
    The idea is to speed up the document ingestion component possibly
    by distributing it so a number of servers are cleaning up the
    documents in parallel or possibly using a GPU to do the task.

11. Investigate whether latency (distance from UMD) has an effect on
    the performance of NewsStand (both app and web versions).
    Investigate the use of a mirror site.  Also Investigate how
    quickly data received by the NewsStand clients can be rendered in
    NewsStand's interface.  The speed of rendering greatly affects
    NewsStand's perceived interactivity.  Determine if the rendering
    time can be shortened somehow.

12. Investigate the use of GPUs for clustering the news articles.
    Some work has been done on this already by Ben Teitler.  The goal
    is to obtain speedups over a sequential approach.  The speedup
    should be more than a factor of 2, which is what was obtained by
    Ben Teitler.  DONE!

13. Investigate a distributed clustering algorithm for NewsStand.  The
    idea is to distribute the computation of the similarity function
    (cosine) with every feature vector and figuring out which ones can
    be pruned.  This is an alternative solution to the one using GPUs
    in 12, although it may also make use of GPUs.

    Note that GPUs have a SIMD (single instruction, multiple data)
    architecture, with many cores all executing the same program, but
    with different data input.  All the processors are able to access
    a single large shared memory space, which is relatively slow, as
    well as smaller cache memories on individual processors, which are
    faster.  GPUs are optimized for arithmetic-heavy tasks which are
    frequently encountered in graphics processing, which makes them
    suitable for heavy number crunching.  On the other hand,
    distributed architectures (e.g. MapReduce) involve many separate
    computers, each with dedicated and possibly non-uniform hardware,
    which communicate over a network.  Both the GPU and the
    distributed architectures are readily programmable using general
    purpose programming languages. 

14. Suppose that each user of NewsStand has access to a device with
    computing power as is planned for the NVIDIA Shield tablet device
    which has a local GPU and that we own and you can borrow.  Can you
    download some of the functionality of NewsStand on these local
    clients.  For example, different RSS news feeds can be ingested by
    these local clients. Devise a scheme where, in essence, NewsStand
    is being outsourced. 

15. NewsStand's geotagging module makes a number of errors.  In
    particular, we have observed repeated errors for certain
    locations.  Your job is to run NewsStand extensively and identify
    entities that are wrongfully identified as locations (e.g.,
    "George" in South Africa and other first names that are often
    wrongfully interpreted as geographic locations) or are associated
    with the wrong geographic location (e.g., "Georgetown" being
    placed in Guyana instead of being associated with "Georgetown
    University" in Washington D.C.).  Some example locations are
    provided below.  This is not exhaustive but gives you some idea of
    the scope of the problem.

    Georgetown
    Northern District
    Batman
    Man
    Barcelona
    Santa Cruz
    Valencia
    Alexandria
    Athens
    Columbus
    Columbia
    Paris
    London
    Berlin
    Moscow
    Salem
    Springfield
    Dover
    Newark
    Augusta
    Portland
    St. Louis
    Dublin
    Hollywood
    George
    San Jose
    Norwalk
    Edgewood
    Portland
    La Paz
    Manchester
    Washington
    San Miguel
    Santiago
    Cairo
    Damascus
    Arlington
    Richmond
    Palestine
    York
    Winchester
    Chester
    Bristol
    Rome
    Vienna
    Brentwood
    Oakland
    Riverside
    Algier
    Geneva
    Monaco
    Frankfurt
    Lexington
    Concord
    Dover

    One possible approach is to have a specialist for the various
    locations.  For example, in the case of "Georgetown" the idea is to
    be able to recognize all possible usages of it when it is the name
    of the University in Washington, DC.  In this case, the references
    are often to the athletic teams.  It would be interesting to see
    the extent to which the general geotagging module in NewsStand can
    be improved, if at all, using these specialized packages.

    Several students can be assigned this project with a competition
    at the end to see who does best.

    SHANG FU PENG has looked at this problem but no definitive
    solution.  Zachary Williams is also currently looking at it.

16. Investigate the utility of time-dependent geotagging where the
    resolution of a possible toponym depends on its current usage.
    For example, "George" may appear a lot in the news at a certain
    time and thus an interpretation as a location may be less
    probable.

17. NewsStand has a topic detection module, which identifies the
    general topic of an article (e.g., business, sports).  Evaluate
    its quality and compare with other work in the literature.  Try
    to develop a better method.

    SAMET AYHAN tried this topic in 2011 but inconclusive results.

18. NewsStand currently only works for articles in English.  This
    means that only news sources in English are handled.  The scope of
    NewsStand could be significantly increased by including articles
    in languages other than English.  The problem here is that these
    articles would not be properly clustered as the TF-IDF mechanism
    is language sensitive.  Also, geographic names differ in the
    various languages.  One suggestion is to use English as the
    canonical language and use Google Translate to convert the foreign
    language articles to English prior to clustering them.  This would
    be done only internally while the output of NewsStand would still 
    refer t<o the news article in its original source language.  Need
    to come up with an evaluation method.  Also, need to be able to
    identify foreign language news feeds and possibly identify a
    foreign language Gazetteer as right now we are assuming an English
    language Gazetteer which may have enough foreign spellings.  The
    foreign language articles should be written in the Latin alphabet
    as it is not clear that Google Translate works with all of the
    foreign alphabets.

    A variant of this was tried in 2011 but inconclusive.  Need to
    have real benchmarks of quality and details on what is done.  Can
    we cluster in different languages in the sense that it would be
    interesting to see what happens if you try to do separate clusters
    for each language.  Need to determine what is currently done in
    NewsStand and what could be done and improved.

    You will have access to Microsoft Translator.

19. Devise a version of NewsStand in German, Spanish, French, Italian,
    Portuguese, etc. provided you are an expert in this language.  The
    main issue here is the possibly limited number of RSS feeds in the
    appropriate language.  Need to upload MANY foreign news feeds.

20. Determine the rate of publication for each news feed in
    NewsStand.  NewsStand's RSS grabber downloads articles from many
    news feeds.  One problem is how to determine an appropriate
    polling interval for the feeds.  If the interval is too short,
    then there won't be any new data in the feed and polling it will
    be a waste of time.  Also, the RSS feed's website may block us if
    we are polling it too quickly (i.e., too often).  On the other
    hand, if the polling interval is too large, then we have a long
    time interval between publication and processing, which is also
    undesirable.   Currently, Newsstand uses an adaptive polling
    method but it's not clear that it works well given the sometimes
    very old articles that appear in the system.  Knowledge of RSS
    feeds publication rates would inform a better RSS grabber. 

    In addition, we have the problem that there is a multitude of RSS
    feeds and we don't always know their source or more precisely
    which ones come from the same source.  In particular, news sources
    often have multiple RSS feeds corresponding to different topics
    (e.g. sports, business) as well as organized by popularity
    (e.g. global headlines, local news).  The same articles often
    appear in multiple RSS feeds.  Given a set of RSS feeds, you need 
    to determine which ones are from the same news source.  Of course,
    all of the RSS feeds have names but the names are different.  For
    example, one name might be ``NY Times - Business'' and another
    might be ``NY Times - Sports'', with different variations like
    this for every newspaper.  Its not clear how to match them
    correctly.  Also, some feeds aggregate articles from many
    different sources, which complicates things, because based on the
    articles that appear in those feeds, you might incorrectly think
    that the feeds belong to certain newspapers.  Being able to solve
    this problem will enable you to build newspapers from the feeds.
    One way to do this is to perform an intersection of the articles
    from all pairs of individual feeds and whenever the intersection
    is nonempty, then we say that the feeds are associated with the
    same newspaper.  This is a bit computationally heavy.  

21. Compare NewsStand performance with NYTimes Longitude
    http://beta620.nytimes.com/viewer/longitude/
    More about the project at
    http://beta620.nytimes.com/projects/longitude/metadata-the-longitude-and-short-of-it/
    Need to do a literature survey.  Also include Pulse, etc.
    Whoever, chooses this project must do a very comprehensive job.

22. NewsStand collects all images associated with a cluster.  Many are
    quite similar and may be duplicates.  Others are near duplicates.
    NewsStand has a feature where duplicates and near duplicates are
    marked.  This is currently done through the use of image features
    and techniques including adaptive color histogramming.  This work
    has not been completed.  Your project is to extend this work and
    the culmination should be of a level of quality that can be
    written up for submission to SIGGRAPH.  This means experiments and
    the use of a number of different techniques from image processing
    and computer vision.  You should not attempt this project if you
    are not willing to put out the necessary effort to get the desired
    results.

    This has some similarity to work in computer graphics on
    "Seam Carving" or "Image Retargeting".  Ari Shamir is an author of
    most of the papers in the area.  He has a 2010 SIGGRAPH Asia
    paper, as well as the others, which are all on his website.  See
    http://www.faculty.idc.ac.il/arik/site/research.asp  
    There are others.  In any case, what they are trying to do is take
    images generated for the big screen and retarget for a smartphone,
    for example.  They are trying to evaluate the quality of the
    retarget.  Their demos show a bunch of variations like cropping,
    etc. and the work evaluates the extent to which people liked them.
    The work suggested in this project is the inverse!  We want to see
    if we can recognize similar, identical, or near duplicate images.

    The cluster information may also be useful in doing face
    recognition.  In particular, the presence in the text of certain
    keywords corresponding to famous individuals can prompt us to look
    for them in the images that are associated with the cluster.  Thus
    these faces 

    HAO LI is a candidate for this project as he is already working on
    it.  However, nothing of subsequence has been achieved on this
    problem, so perhaps it is time for a new start.  One approach is to
    use the associated text to help in doing face recognition in the
    image.  In other words, the associated text gives you some clue of
    who might be in the images.  This is a good pruning technique.

23. Evaluate the use of a cluster's representative feature vector
    (e.g., centroid) to retrieve similar images.  These are images
    that are in an archive of images associated with document/article
    clusters that have been obtained in the past.  If they are about a
    similar news subject, then they will have similar feature
    vectors.  This is a way of using current clusters to obtain
    similar clusters and their associated images.  For example, this
    feature could be used to obtain photos of individuals such as
    Governor Arnold Schwarzenegger from different periods of time.

24. Investigate the manner in which NewsStand's retrieval of images
    could be used in conjunction with the techniques used by the
    system of Cortica (a company founded by Prof. Zeevi of the
    Technion) at cortica.com

    The basic idea behind Cortica seems to be as follows. When the
    user inputs an image, Cortica first classifies it into one of n
    predetermined categories (face, logos, landmarks etc.). Then for
    each category, they use well known feature extraction algorithm
    that is known to work well in the category.  Now, the problem
    becomes a search in a high-dimensional space for a similar image
    whose tag will now be used for the untagged input image. Their
    algorithm works well for certain categories (like logos), while
    badly for certain other categories (like, outdoor images).  As can
    be seen there is nothing really difficult about their technique.

    There is a synergy between what is done in NewsStand and what
    Cortica does.  In NewsStand we associate text tags with images but
    do not do much of image processing, while Cortica is a hardcore
    image processing company that does not seem to have a good way of
    generating good image tags so it tries to reuse existing ones by
    virtue of having similar image feature vectors.

    Cortica aims at producing tags for the images.  In NewsStand, we can
    do this with the news Stories by looking for the people, diseases,
    locations, etc. in the feature vectors (based on text rather than
    image features) associated with them.  Of course, the whole idea
    of Cortica is the inverse from NewsStand in the sense that they
    are given arbitrary images and then they try to find similar
    images in their database and use their tags to tag the input
    image.  A good project is to use our archive of images with
    their feature vectors to see if a new image has a descriptor or
    something like this that leads to a match.  However, in NewsStand,
    we do not index our archive by image feature vector, only by text
    feature vector. So, our archive as it is now may be of only a
    limited use to Cortica.  Nevertheless, this is something worthy of
    trying out.  In particular, we could use the text feature
    similarity as a filter for the set of images that we search for
    true similarity.

    Also look at http://www.google.com/insidesearch/searchbyimage.html

25. Look at images used by Shawn Newsam for different weather
    conditions at the same place and see if our near duplicate image
    detection methods are applicable.  The National Park Service Air
    Quality Web Cameras may be useful. This is a collection of around
    15 cameras from parks around the nation. The website has both the
    latest images as well as archived images going back several years.
    http://www.nature.nps.gov/air/webcams/index.cfm
    Follow the links on the map to the web pages for individual
    cameras.  There you will find an archive link below the current
    image. The following link is for the Sequoia-Kings Canyon camera:

    http://www.nature.nps.gov/air/webcams/parks/sekicam/seki_arc.cfm

    Shawn's student has picked 10 images from their Phoenix data set
    with varying weather conditions.  They can be downloaded from:

    http://faculty.ucmerced.edu/snewsam/Hanan/sample_images.zip

    There is also a slight shift in perspective in some of the images
    due to camera movement.

26. Devise a system to perform hash tag discovery for Tweets in
    TwitterStand.  In other words, given a tweet without hash tags,
    suggest some relevant hash tags to add to it.

    This was attempted by Eric Krokos in the Fall of 2013 and he got
    a paper on it accepted at the 2014 SIGSPATIAL Workshop on
    Location-Based Social Network.  This paper needs lots of more work
    especially in giving it a spatial component.

27. Devise a method for clustering Tweets of news.  In TwitterStand
    this is done using the links in the Tweets.  Can you come up with
    a better method?  This will also require evaluation.

28. NewsStand, TwitterStand, and PhotoStand on Kindle Fire.  This must
    be a complete job.  Cannot say that no time to finish.

    Jagan may have done this already.  Check with him.

29. NewsStand, TwitterStand, and PhotoStand on Windows Surface.  This
    must be a complete job.  Cannot say that no time to finish.

30. Fix NewsStand and PhotoStand systems so panning is instantaneous
    and so don't have to wait for the markers.  Use the Photosynth TED
    video of the talk of Blaise Aguera y Arcas
    http://www.ted.com/talks/blaise_aguera_y_arcas_demos_photosynth.html
    as an example of the type of smooth transitions that we wish to have.
    This may be related to the ACMGIS'13 paper of Chang and Brendan.

THE FOLLOWING ARE PRETTY WEAK AND ARE NOT RECOMMENDED:

31. Extension of the Ontuition system for keeping track of clinical
    trials to doing disease monitoring.  Look at the paper at

    M. D. Adelfio, M. D. Lieberman, H. Samet, K. A. Firozvi.
    Ontuition: Intuitive data exploration via ontology navigation.
    In A. El Abbadi, D. Agrawal, M. Mokbel, and P. Zhang, editors,
    Proceedings of the 18th ACM SIGSPATIAL International Conference
    on Advances in Geographic Information Systems, pages 540-541, San
    Jose, CA, November 2010.
    http://www.cs.umd.edu/~hjs/pubs/ontology.pdf
    Categories: [spatio-textual search engine] 

    Ontuition is a system for mapping ontologies.  The system
    currently allows users to explore an ontology of clinical trials,
    taken from the clinical trials.gov web site, using hierarchical
    navigation in one pane and map navigation in another.  Look into
    extending Ontuition with other ontologies, such as disease
    outbreak data.

32. The mobile web version of NewsStand allows users to toggle between
    controlling the map viewport and the web page viewport.
    Controlling the map viewport enables users to pan and zoom the map
    to see varying levels of detail or different locations.
    Controlling the web page viewport enables users to increase the
    size of symbols on the map.  The dual viewport abstraction is
    useful for users, but introduces some issues, such as when a user
    has zoomed the map viewport and performs a search for a geographic
    location, the resulting location is centered on the map, but the
    center of the map may not be visible within the web page viewport.
    Extend the mobile web version of NewsStand so that the result of
    the search is centered on the screen. 

33. Perform user studies on the efficacy of decoupling the viewing and
    touching actions on a smartphone like iPod Touch for applications
    such as NewsStand.  The idea is to look at the various form input
    elements such as the select list and text elements.  In
    particular, we add a next and previous controller which enables
    stepping through all boxes that require input.  The novelty is that
    the inputs can be menus of items as is handled by the select list
    widget, boxes which are filled, and also buttons which are
    pressed.  If the next or previous items are buttons, then a dialog
    box comes up with the button which can be selected.  In addition,
    we also add vertical arrows to enable motion in the vertical
    direction when there are rows of widgets (Google Android has this
    feature but only for lists).  A "return" button on the text
    element widget indicates that the input is complete.  The "return"
    is also useful in the case of a select list where we selecting
    multiple number of items.  In the case of a button, the return is
    somewhat meaningless unless it is used to exit from the command
    processing mode. 

    The key idea is decoupling the act of viewing a smartphone (mobile
    device) screen from the act of manipulating it.  The problem is
    that in order to have meaningful gestures, one needs to have enough
    space for the thumb or finger to make the contact.  If the space
    is too small, then the action cannot be captured unless one zooms
    in via a compacting pinching action, taking the action, and then
    taking a pinching out action.  This is tiresome and users do not
    want to have to make the two pinching actions each time an action
    is needed.  On the other hand, viewing the smartphone screen has
    sufficient clarity and resolution.

    The above is all in the context of making a web application run on
    a smartphone where the interaction mechanism of the thumb is too
    coarse for the various menus.  The alternative is using a stylus
    in the form of a pen which requires two hands (one to hold the
    device and one to hold the pen).  The smartphone adaptation of the
    web version of NewsStand needs to be modified to make use of this
    UI interaction method and the appropriate user study needs to be
    conducted.  This should result in a SIG CHI paper.

    This project should be done together with project 31 but probably
    not really worthwhile given the state of the NewsStand App.

34. Perform user studies for one-handed thumb interface for the App
    version of NewsStand and compare with the two-handed approach.
    This should be of a quality for submission to SIGCHI.

    Jenny Tortora and Giuliana in Salerno are collaborating on this at
    the moment.

35. Adapt STEWARD to deal with Wikileaks cables thereby enabling
    viewing of cables.  This could also be done with NewsStand where
    the data are the cables instead of the RSS news feeds.  In this
    case, the results would be clusters of similar cables which would
    be accessed by location as well as keyword.  The clustering would
    again be based on use of TF-IDF.  This will require interaction
    with Mike Lieberman on the use of STEWARD's geotagger or an
    improvement on it that is being used in NewsStand. 

    A problem with this project is determining its legality.

36. Adapt STEWARD to deal with documents of USAID or the State
    Department.  For this you need to have a contact at this agency.
    This is a long shot project where attempts through the State
    Department Geographer failed to elicit any interest.  This is a
    long shot project.

37. Finding reliable tweeters of news.  Nick Gramsky started this work
    as a class project but the definitions of "good" tweeters are
    quite weak.  Also, the experimental results are inadequate.  Needs
    much more work but is a hot topic.  See the report at:
    http://www.cs.umd.edu/~hjs/seederfinder.pdf

38. BlogStand:  an analogous system to NewsStand but for Blogs.  In
    other words we want to geotag blogs and provide a map query
    interface to them.  This involves identifying blogs, etc.  The
    work is analogous to NewsStand but probably less images and
    videos.

    Patricia Sazama tried this in Fall 2013 but veryu inconclusive.

39. Look into what is being done with news by Jure Leskovec at
    Stanford.  Seems to have 4 billion news articles.
    http://engineering.stanford.edu/news/big-questions-big-data-stanfords-jure-leskovec

40. Ambuj Singh suggestions of more complicated queries in NewsStand
    like "is there an article about X in an Israeli paper" and "no
    article about Y in an Irish paper"?  etc.  In other words, we are
    looking for queries involving the various options in NewsStand and
    how to express them within the touch framework.  Need to get more
    queries.

41. Implement database of NewsStand using Vertica community edition
    and see if it is possible and how it performs vis-a-vis
    PostgreSQL.  Need to consult with Marco Adelfio.

42. Does it make sense to use MongoDB for NewsStand instead of
    PostgreSQL?  Need a performance study.


SPATIAL DATA STRUCTURES AND DATABASE RELATED PROJECTS:

1.  Develop a GPU implementation of Ang, Samet, and Shaffer's spatial
    range query (image dilation or region expansion).  Want to see how
    much speed up is achievable.  You need to read the following paper:

    C.-H. Ang, H. Samet, C. A. Shaffer.  A new region expansion for
    quadtrees.  IEEE Transactions on Pattern Analysis and Machine
    Intelligence, 12(7):682-686, July 1990.
    http://www.cs.umd.edu/~hjs/pubs/AngPAMI90.pdf
    Also see Proceedings of the Third International Symposium on
    Spatial Data Handling, pages 19-37, Sydney, Australia, August 1988
    Categories: [spatial algorithms].

    A description can also be found in Section 6.6 of 
    of Applications of Spatial Data Structures: Computer Graphics,
    Image Processing, and GIS.  Addison-Wesley, Reading, MA, 1990.

    This project will require familiarity with GPU programming.  We
    have a GPU cluster for you to work with.

    This project was attempted once but no real results.

    This project was attempted by Jesse Moll in 2013 but he did not
    take it to the paper level nor is it clear that it used the Ang
    method.

2.  Embed the SAND Browser into Google Maps, Bing Maps, or Google Earth.

3.  Implement the algorithm for the computation of path and distance
    oracles using Morton codes as described in
    J. Sankaranarayanan, H. Samet, H. Alborzi.
    Path oracles for spatial networks.
    In Proceedings of the VLDB Endowment PVLDB: Proceedings of the 35th 
    International Conference on Very Large Data Bases (VLDB), volume
    2, pages 1210-1221, Lyon, France, August 2009.
    http://www.cs.umd.edu/~hjs/pubs/wsp-vldb09.pdf
    Categories: [spatial network] 

    Result should also have simple pseudo code as in Samet's
    Foundations of Metric and Multidimensional data structures book.

4.  A distance oracle [Sankaranarayanan and Samet et. al., ICDE 2009]
    is a way of estimating the network distance between any pair of
    vertices of a spatial network (e.g., a road network).  The
    distance oracle can be used to decompose a road network into
    <key,value> pairs (i.e., hash table), such that given any pair of
    source and destination vertices, and an estimation error bond
    \epsilon, the use of an approximate distance oracle of size
    O(n \epsilon^2) enables on to obtain the \epsilon-approximate
    nearest neighbor distance by making O(log n) accesses to the hash
    table. The term n is the number of vertices in a road
    network. There are couple of research problems to solve here:

    a) Distance oracles are expensive to compute since you need the
       distances between every pair of vertices in the road network.
       However, there are many obvious ways of cutting corner here to
       achieve real speedups. Can you figure out a way of scaling
       these oracles to work on a data set spanning the whole of USA?

    b) Can you then map the distance oracles to a key-value store to
       enable the use of cloud computing infrastructures such as
       MapReduce for performing operations on large road network
       datasets?  Also, can you reduce the number of access to be one
       or two rather than O(log n). This is an easy part since the
       mechanics of it are quite easy.

    The outcome of this project could be a really useful service on
    the web where thousands of people can pose distance queries and
    can get approximate network distances in fractions of seconds.

    Shangfu Peng has pretty much done this.  Need to check with him if
    any of them still remain viable suggestions at this point.

5.  Incremental nearest neighbor algorithm:

    A.  for Frechet distance
        See SIGSPATIAL'12 paper titled
        A GPU approach to sub-trajectory clustering using the Frechet
        distance Joachim Gudmundsson, Nacho Valladares  

    B.  Incremental nearest neighbor algorithm for Earthmover distance
        See papers by David Jacobs and Leo Guibas

6.  Look into Charikar and Broder's work on LSH (Kanellakis award) and
    figure out what how it is related to the example in Samet's
    "Foundations of Multidimensional and Metric Data Structures" as
    well as try to incorporate it into the text.

    Charikar's paper on simhash:
    M. Charikar.
    Similarity estimation techniques from rounding algorithms.
    Proceedings of STOC. pp. 380-388. 2002.
    http://www.bradblock.com.s3-website-us-west-1.amazonaws.com/Similarity_Estimation_Te+chniques_from_Rounding_Algorithms.pdf

    It could be that the example in Section 4.7.4 on pages 711-716 in
    Samet's Foundations of Multidimensional and Metric Data Structures
    that uses binary digits is in fact simhash.  For a good treatment
    of simhash, see http://matpalm.com/resemblance/simhash/

    Broder's paper on minhash:
    A. Z. Broder.
    On the resemblance and containment of documents.
    Proceedings of IEEE Conference on Compression and Complexity of
    Sequences.
    pp. 21-29. 1997.
    http://gatekeeper.dec.com/ftp/pub/dec/SRC/publications/broder/positano-final-wpnums.pdf

7.  Provide simple pseudo code Yi and Samet ICDE'13 paper in the style
    of Samet's algorithms and see if it can be extended to metric spaces.

8.  Look into structured graphs work of Leila De Floriani and see if it
    can be applied to social networks.  Need to look up her papers.

9.  Spatially-oriented social network identification.  In other words,
    investigate how to incorporate knowledge of spatial proximity
    (also temporal proximity).

10. Is it possible to use differential privacy in a spatial context?
    SHANGFU PENG is an expert here.

11. Parallel graph traversal in a distributed computing environment.
    The problem of graph traversal is concerned with searching through
    graph nodes to find solutions with respect to a certain
    optimization criteria.  For example, the problem of finding the k
    nearest supermarkets with respect to a location q in a road
    network can be modeled as a problem of traversing the road network
    to find a supermarket p (in a set of supermarkets) which minimizes
    the network distance D(q,p).  This project is concerned with
    applying the distributed computing frameworks, e.g, MapReduce,
    Distributed GraphLab, to formulate different graph traversal
    algorithms for different spatial-network query problems, e.g.,
    range, nearest neighbor, reverse nearest neighbor, skyline and
    distance join queries. 

12. Survey of spatio-textual join papers and work.
    a.  ICDE 2013
    b.  IEEE TKDE "A Prefix-Filter based Method for Spatio-Textual
        Similarity Join" by Liu, S. ; Tsinghua University, Beijing ;
        Li, G. ; Feng, J.  To appear.
        http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6517849
    Try to embed such a query into NewsStand

13. Connected component labeling on GPU and maybe other operations.  E.g.,
    http://scholar.google.com/scholar_url?hl=en&q=http://link.springer.com/article/10.1007/s11265-013-0780-0&sa=X&scisig=AAGBfm2l2qu-nTRZNaFxYCtXYksN8KX-dg&oi=scholaralrt

14. Can you adapt the following methods to spatial data like quadtrees
    or some other appropriate variant:
    Optimizing bitmap indices with efficient compression
    K Wu, EJ Otoo, A Shoshani.
    ACM Transactions on Database Systems (TODS) 31 (1), 1-38.

    Look at other work by K. (John) Wu especially 

    On the performance of bitmap indices for high cardinality attributes.
    K Wu, E Otoo, A Shoshani.
    Proceedings of the Thirtieth international conference on Very
    large data 2004, 24-35.

    Also see 
    P. Negparkar, S. Candan, and A. Bhat.
    Compressed spatial hierarchical bitmap (cSHB) indexes for
    efficiently processing spatial range query workloads.
    PVLDB 8(12):1382--1393, August 2015.

15. Reimplement the SAND Internet Browser using the Vertica community
    edition.  This means substituting the SAND homegrown database by
    the Vertica database.  Is this possible?  Need to ask Jagan and
    also if it is worth it in the sense of what we would gain from
    doing so.  This would mean that the various algorithms for
    computing the incremental nearest neighbor and range queries would
    need reimplementing.  This also means investigating how to use a
    pointerless quadtree in this environment or if one would use a
    pointer-based quadtree.  One option is to store quadtrees in an
    array representation where pointers are array indexes.  Problem is
    that there is little ability for random access and need to start
    at the root.  Also, does not support updates

16. Ask Vertica for some of the queries that their spatial customers
    would like to execute and see if we can develop algorithms.  Were
    worried about representing polygons with holes.  How to do with PMR
    and PM2 quadtrees.

    Tried this a number of times but no response.

17. Does it make sense to use MongoDB for spatial data such as the
    SAND Internet Browser?

18. K. Wu results on connected component labeling:

    Kesheng Wu, Ekow Otoo, and Kenji Suzuki 
    Optimizing two-pass connected-component labeling algorithms
    Pattern Analysis and Applications. 
    June 2009, Volume 12, Issue 2, pp 117-135

    Claim to be fastest method of doing connected component labeling.
    Compared it with the following algorithm: as being of the same
    type as the Dillencourt/Samet/Tamminen algorithm.  However, its
    not clear that Florio and Gustedt implemented the
    Dillencourt/Samet/Tamminen algorithm.  So, a comparison with
    the Wu/Otoo/Suzuki may be worthwhile.

    Fiorio C, Gustedt J.
    Two linear time union-find strategies for image processing.
    Theor Comput Sci 154(2):165181.  1997.

19. Look at the maps in 
    http://www.washingtonpost.com/blogs/blogpost/post/40-maps-that-explain-the-world/2013/08/12/5775d214-0364-11e3-bfc5-406b928603b2_blog.html
    and show how the information that they capture could be presented
    using queries in the SAND Internet Browser.  This means expressing
    the underlying data as spatial relations and then formulated
    queries using the primitives available in the SAND Internet
    Browser possibly using the functionality of SAND-Tcl.  See
    http://www.cs.umd.edu/~hjs/pubs/sandtcl.ref.pdf

20. Fix Mapping Apps of Apple Maps, Google Android Maps, and iOS
    Google Maps to show entire world at full zoom out level.  In other
    words, enhance API and App to show whole world as in iOS5.
    Otherwise, one cannot get the notion of browsing a newspaper to
    really work as too slow and cannot get the big picture.

21. Investigate notion of implementing a point quadtree variant with a
    splay strategy.  In other words, how do you splay the data
    points?  It seems similar to the notion used by Samet in

    See H. Samet.
    Deletion in two-dimensional quad trees.
    Communications of the ACM, 23(12):703--710, December 1980.

    to delete where we make the closest node to the deleted node the
    root of the new point quadtree.  The splay tree of Tarjan and
    Sleator does this by rotation operations while we do it with one
    operation.  However, the result is the same in the sense that the
    replacement node becomes the new root node.  The only difference
    is that we actually rebuild the underlying point quadtree with the
    new root, while in the one dimensional splay tree, we move the
    deleted (or accessed) node to the root via rotation operations
    which preserve the structure of the binary search tree.  The whole
    idea is analogous to a self-organized file.

    This was motivated by listening to Eunhui Park's Ph.D. defense
    although she was unaware of this relationship.

    QUESTION:  Can Samet's point quadtree deletion algorithm be
    expressed in terms of splay operations?

    QUESTION:  Samet observed that, empirically, the total path length
    decreased as a result of the deletion operation.  Also a
    logarithmic execution time was observed. 

    a.  Can you prove the logarithmic deletion time?
    b.  What about an amortized logarithmic deletion time?
    c.  Can you prove that the total path length decreases in 2d?
        What about in 1d?

22. Eunhui Park Thesis defense observations:

    a.  Look into the scapegoat tree which is similar to the splay
        tree but involves no rotations (p. 13).

    b.  The treap is analogous to the priority search tree except that
        it uses a binary search tree on the values and a heap on
        priorities (assigned to the values at random so that we don't
        have the non-randomness of the values which means that sorted
        data might not be a random property).  In essence, the
        structure is nothing more than a Cartesian tree.  The concept
        is analogous to the skip list except that no lists are used.

    c.  The splay quadtree is a misnomer.  It is really a splay
        BBD-tree but it does not sound so exciting.  There cannot be
        a structure such as a splay quadtree (really a PR quadtree),
        because the structure's shape is independent of the order of
        insertion.  The BBD-tree can be splayed because the
        aggregation process is decoupled from the partition process.

    d.  Pointless to talk of having a splay red-black tree as this is
        nothing but a 2-3 2-3-4 tree (order 3 or order 4 B-tree).  The
        fact that these corresponding B-trees are balanced makes it
        pointless to ask about how to splay them as all this can
        achieve is rotation and is much more complex to specify in the
        framework of a red-black tree than in the framework of a
        B-tree.

    e.  It would be nice to have an example of a skip quadtree (p. 19)
        using the point data from Samet's data.

    f.  Explain Chan's method using Samet's data (p. 22).

    g.  Check Fair Split Tree of Kosaraju and Callahan (pp. 23-24) and
        see if it can be shown using Samet's data.

    h.  Give an example of spanners (p. 24) using Samet's data.

    i.  Explain concept of doubling dimension (pp. 25-26).

    j.  Ask Jagan about homogeneous well-separation (p. 80).

    k.  Ask Jagan about the well-separated pair construction algorithms
        in the thesis.

    l.  What is the relation of net trees and r-nets to traditional
        metric space data structures like mb-trees, etc.?

NOT WELL-DEFINED PROJECTS:

1.  Volunteer geographic information systems
    a.  Workshop at SIGSPATIAL'12
    b.  http://scholar.google.com/scholar_url?hl=en&q=http://onlinelibrary.wiley.com/doi/10.1111/tgis.12039/full&sa=X&scisig=AAGBfm05uni8BuefGPiF3h4MpWZAMeZaOQ&oi=scholaralrt 

2.  Sentence Similarity with spatial data.  E.g.,
    http://scholar.google.com/scholar_url?hl=en&q=http://www.aclweb.org/anthology/N/N13/N13-1089.pdf&sa=X&scisig=AAGBfm0jdx1nhl4y7qi1i3SMaZ0Wf5xLRQ&oi=scholaralrt

3.  C. Garth, K. Joy.  Fast, Memory-Efficient Cell Location in
    Unstructured Grids for Visualization.  IEEE TVCG 16(6), 1541-1550,
    2010.  Mentioned in
    http://www.iwr.uni-heidelberg.de/groups/CoVis/Teaching/VNT_SS13/vis2-4_effDatenstrukturen.pdf 

4.  A. Telea. Data Visualization. AK Peters Ltd., 2007.
    Mentioned in
    http://www.iwr.uni-heidelberg.de/groups/CoVis/Teaching/VNT_SS13/vis2-4_effDatenstrukturen.pdf

5.  See VTK Visualization Tooklit (Kitware)

6.  Text summarization via Hidden Markov Models (Dianne O'Leary SIGIR 2001)
    http://dl.acm.org/citation.cfm?id=384042
    Mike Szaszy is a candidate for working on this topic

7.  Measure clustering quality
    Mike SZASZY project

8. Look into paper on toponym resolution and see what they do.
    Text-Driven Toponym Resolution using Indirect Supervision
    M. Speriosu and J. Baldridge
    Proceedings of the 51st Annual Meeting of the Association for
    Computational Linguistics, pages 1466-1476,
    Sofia, Bulgaria, August 4-9 2013.
    http://scholar.google.com/scholar_url?hl=en&q=http://acl.eldoc.ub.rug.nl/mirror/P/P13/P13-1144.pdf&sa=X&scisig=AAGBfm1kAADgLLcMnUTnjIfQEU_hnq0aEw&oi=scholaralrt
    This paper encourages using the presence of clues like "lobster"
    to help resolve "Portland" in favor of "Maine" over "Oregon".
    This comes out of linguistics literature.  Examine it.  This is
    also related to use of concepts by Gianluca.

9.  Fiona from Australia sent following reference:
    Kjetil e Vaage (2012)  Sensing the News: User Experiences when
    Reading Locative News. Future Internet 2012, 4, 161-178;
    doi:10.3390/fi4010161 

10. Look into the European LEILAS "EruoStars" project at
    http://www.leilas.eu/ 
    This website has information about work plans for the project.  Try
    to find out what they actually accomplished.  This means finding
    technical reports, etc.  At the moment, all that the site contains
    are descriptions of tasks for a proposal.  It was started in
    2011.  So, there should be some results.  One four page paper in
    French, which is lacking in details, can be found at
    http://scholar.google.com/scholar_url?hl=en&q=http://www.viseo.net/sites/default/files/viseo-group/article-leilas_representation_geospatiale_d_ensemble_de_documents_textuels-ecg2013.pdf&sa=X&scisig=AAGBfm3KqeEfEYf5cqS94Xnq53QuYLOHYw&oi=scholaralrt

11. The GeoXLS (ACMGIS'11) system is a search engine that allows users
    to submit a set of locations as a query object Q and to find
    documents containing locations similar to those in Q.  Search
    results come from a collection of geotagged web documents,
    specifically a vast collection of spreadsheets obtained from the
    Web.  The results are ranked according to their similarity to Q,
    where similarity is measured using the Hausdorff distance or
    modified Hausdorff distance (a variant that reduces the effect of
    outliers).  For example, given a point set query consisting of
    locations where a disease outbreak has been reported, one could
    search a database of previous outbreaks to find others with
    similar geographic distribution.  The search algorithm iteratively 
    computes estimates of the distances between the query point set and
    the collection being searched, to prune the search space without
    computing the Hausdorff distance from the query set to every other
    point set in the collection.

    a.  The system currently searches point sets from a collection of
        spreadsheets downloaded from the Web.  Identify other possible
        collections of point sets to search, and extend GeoXLS to
        search these point sets.  One possible source of point sets is
        the Wikipedia. 

    b.  Construct a GPU implementation of the incremental Hausdorff search
        algorithm utilized by GeoXLS.

    c.  The Hausdorff distance and modified Hausdorff distance are both
        computed by identifying the nearest neighbor of each point, but a
        single point can be used as the nearest neighbor of many points in the
        other point sets (for example, a point set with 100 points in College
        Park and 1 point in London will be "similar" to a point set with 1
        point in College Park and 100 points in London by these measures).
        Identify a distribution-sensitive similarity measure between a
        pair of point sets that limits this effect, and extend GeoXLS
        to support it. 

12. Apply methods such as Hausdorff distance computation to the
    problem of Hurricane detection and tracking similar ones.  This is
    quite vague.  It requires a person who is a domain expert in this
    area. 

13. Compare NewsStand with Pulse and Flipboard and other online news
    aggregation sites.  What enhancements could NewsStand use?

14. Spatial keyword queries - see SIGMOD'13 paper and references cited
    therein 
    Given a location and set of key words
    a.  Find all or minimum set of objects that covers a set of
        keywords. 
    b.  Could minimize area of covering region in a

15. Spatial join papers of Alimaki group at EPFL - SIGMOD'13, EDBT,
    SSDBM'13, etc. 

16. SIGMOD'13 paper in Session 21 on Tweets and finding all garage
    sales withing 24 miles of a specific location.  The idea is to
    constantly update the query results as events come up rather than
    static.  Currently, updates only happen in TwitterStand/NewsStand
    when the map moves a bit.

17. SIGMOD'13 paper on spatial keyword searches where want objects
    with certain properties from a minimal spatial area, etc.  Not
    very novel but could do some extensions.

18. Look into road network paper at SIGMOD'13 that improves on Samet
    SIGMOD'08 paper.

19. Raymond Wong paper at SIGMOD'13 about bottleneck problem and
    compare with Efrat and Itai paper

----------------  NSF Key Value Store Proposal Tasks  ---------------- 

RESEARCH TIMELINE:

Year 1:
A.  Point Cloud
    1.  Derive a data partitioning scheme for a large point cloud.
    2.  Formulate a set of parallel algorithms for querying large
        point clouds. 
    3.  Evaluate the efficiency of the proposed parallel algorithms by
        analyzing  factors like the degree of parallelism,
	communication cost and granularity. 
B.  Loose Quadtree
    1.  Implement a distributed variant of the loose quadtree
        for indexing moving objects.
    2.  Derive parallel query algorithms for the distributed loose quadtree.
    3.  Evaluate the effectiveness of the distributed loose quadtree
        by comparing it to grid-based moving object indexes.
C.  Multiresolution Range Query
    1.  Formulate a distance-preserving hashing technique which enable range
        queries to be evaluated via simple lookup operations.
    2.  Compare and contrast the performance of the proposed technique to
        a classical solution based on hierarchical indexes.
    3.  Evaluate the effectiveness of the proposed hashing technique
        by applying it to real-world online map applications.

Year 2:  Distance Oracle
    1.  Investigate the possibility of mapping distance oracles to a
        key-value store. 
    2.  Devise a method to provide access to the key-value store
        representation of the distance oracle using one or two
        accesses, most of the time.
    3.  Demonstrate the applicability of our methods to Internet-scale
        applications by studying the throughput, average access times,
        and the quality of the answers.
    4.  Deploy the distance oracle in complicated query processing
        scenarios, especially those in location-based services and
        local searches.
    5.  Devise a set of publicly available distance oracles for the
        entire United States for different values of $\epsilon$.

Year 3:  Distributed Graph Traversal
    1.  Derive a parallel graph traversal method method for a
        message-passing-based distributed computing framework like Pregel.
    2.  Derive a parallel graph traversal method which does not require
        communication between tasks for the MapReduce framework.
    3.  Formulate parallel algorithms for different query problems
        (e.g., range, nearest neighbor, reverse nearest neighbor, and
        distance join queries) using the derived methods.
    4.  Compare and contrast the performance of the
        messaging-passing-based algorithms with algorithms that do not
        require communication between parallel tasks in different
        processing environments.

The main impact of this project is expected to be a spatial database
management system which supports a wide range of spatial query types
and is capable of "scaling out" to handle a large amount of spatial
data. This will enable spatial applications such as online mapping,
computer aided design, online gaming and scientific simulations to
store terabytes of spatial data and to execute queries such as
shortest path and self-join on them. Thus the improvement will lie in
the scale and range of the sizes of the data sets that we will be able
to handle. 

-----------------  NSF GeoMultiMedia Proposal Tasks  ----------------- 

RESEARCH TIMELINE:

Year 1:  Geotagging and Clustering Improvements for News and Tweets
a.  Identify features to be used in a machine learning based approach
b.  Evaluate results of using machine learning vis-a-vis current
    rule-based approach
c.  Compare results with existing systems such as OpenCalais and
    Placemaker
d.  Develop methods to identify reliable seeders for TwitterStand
e.  Develop GPU-based document clustering based on an analogy to
    sparse matrix multiplication

The evaluation of the effectiveness of these components will be
ongoing.  In the first year, we will construct ground truths which
consist of news articles that we deem to contain representative
articles with the type of geographic locations that we wish to
recognize and resolve correctly.  In addition, we will make use of the
error feedback mechanism to collect geographic locations that the
system has consistently failed to recognize or has resolved
incorrectly.  These articles will form a set which will be tested from
time to time to see if subsequent system improvements result in the
errors disappearing.

Year 2:  Image Similarity Using the Associated Text
a.  Devise system to retrieve newsworthy images and videos using a map
    query interface
b.  Develop techniques for context-based image similarity search
c.  Identify image  transformation types that result in
    partial-duplicate images 
d.  Apply partial-duplicate image detection as filtering mechanism for
    image queries.  The evaluation will be conducted using a ground
    truth which is a set of images that we have been collecting during
    the first and second year that are representative of the type of
    similarity which we would like the system to be able to detect.
    The evaluation will be ongoing with feedback for enhancement.

Year 3:  Use of Language Translation Service for News Text
a.  Evaluate translation services (e.g., Google Translate) in terms
    of preservation of the cluster property
b.  Evaluate effectiveness of using translation of cluster centroids
    as cluster representatives for foreign language documents
c.  Evaluate use of non-native language documents to find similar
    clusters in the native language
d.  Use Arabic language news web sites to evaluate translation quality
e.  Identify reliable seeders for TwitterStand in other languages such
    as Arabic.
    The evaluation in this context is more of a qualitative nature in
    the sense that what we are doing is trying to incorporate the
    results of the translation service and determining if it makes
    sense to do so and if so, in which components of our system.  The
    focus on Arabic in the case of seeders for TwitterStand is
    primarily because we have access to a collaborator who has done
    much work in processing news sites in the Arabic language.  We
    hope to draw on his insight in processing Arabic tweets.  The
    identification of seeders will be ongoing with the final
    evaluation done in the third year. 
    Identification will use factors such as being the first to tweet
    on an important news topic, the number of tweets by others on the
    same topic, the number of different topics tweeted on, and the
    extent to which the seeder re-tweets (a negative factor).  Of
    course, we also need to define the newsworthiness of the topic.
    This is done by making use of the url that the tweet usually
    points at and whether there are existing news articles in its
    cluster.  The evaluation is done over an extended time period. 

POSTDOC

The postdoc will be working on the geotagging module which is a key to
the success of the system.  The postdoc will focus on the machine
learning component in the first two years while also investigating the
feasibility of the development of a language-independent geotagger.
This will most likely be based on machine learning as the rule-based
techniques have a large language-dependent component.  The postdoc
will also work on investigating the extent to which the translation 
preserve clustering.

--------------------------  2015 New Ideas  -------------------------- 
1.   Read the following IEEE PAMI June 2015 article on image similarity:

     IEEE Transactions on Pattern Analysis and Machine Intelligence
     Vol. 37, No. 6, June 2015 
     What Can Pictures Tell Us About Web Pages? Improving Document Search
     Using Images By Sergio Rodriguez-Vaamonde, Lorenzo Torresani and
     Andrew W. Fitzgibbon

     Read the full article at
     http://www.computer.org/web/tpami?lf1=5391298952c991816059537a42786516 

     Traditional Web search engines do not use the images in the HTML
     pages to find relevant documents for a given query. Instead, they
     typically operate by computing a measure of agreement between the
     keywords provided by the user and only the text portion of each
     page. In this paper we study whether the content of the pictures
     appearing in a Web page can be used to enrich the semantic
     description of an HTML document and consequently boost the
     performance of a keyword-based search engine. We present a
     Web-scalable system that exploits a pure text-based search engine
     to find an initial set of candidate documents for a given
     query. Then, the candidate set is reranked using visual
     information extracted from the images contained in the pages. The
     resulting system retains the computational efficiency of
     traditional text-based search engines with only a small
     additional storage cost needed to encode the visual
     information. We test our approach on one of the TREC Million
     Query Track benchmarks where we show that the exploitation of
     visual content yields improvement in accuracies for two distinct
     text-based search engines, including the system with the best
     reported performance on this benchmark. We further validate our
     approach by collecting document relevance judgments on our
     search results using Amazon Mechanical Turk. The results of this
     experiment confirm the improvement in accuracy produced by our
     image-based reranker over a pure text-based system. 
 
     This work is similar in spirit to the paper 
     J. Sankaranarayanan, H. Samet
     Images in news.
     In Proceedings of the 24th International Conference on Pattern
     Recognition, pages 3240-3243, Istanbul, Turkey, August 2010.
     Categories: [spatio-textual search engine, Twitter] 
     See http://www.cs.umd.edu/~hjs/pubs/news-icpr.pdf

     The goal here is to compare with NewsStand and to get a paper
     published at least in the ICPR'2016 conference or even CVPR and a
     journal paper in IEEE PAMI Journal.

2.  Using NewsStand and TwitterStand for disaster/emergency
    management.  Look at what Jagan and Hong put together for the
    Nepal Earthquake.


3.  Modeling newspapers by their local lexicons

4.  User studies with researchers at the University of Salerno on the
    one-handed interface

5.  Google watch NewsStand implementation on Motorola (round face)

6.  Google watch NewsStand implementation on Samsung (square face)

7.  Apply Samet/Lindenbaum analysis in SIAM Journal to three
    dimensional data to see if PMR quadtree as well as PM1 quadtree
    and MX octree have similar complexity bounds as the
    two-dimensional variants

8.  Analyze Samet's point quadtree deletion algorithm which makes use
    of a splay-like operation to see if can prove that the total path
    length is decreased as a result of the operation.  Also see if the
    complexity of this implementation of deletion is indeed O(log N)
    as observed by Same in his 1980 CACM article.  See

9.  Shangfu image gallery for instagram.  Any other applications?

10. Read the article
    "Making the Mobile Web Faster" by K. Matsudaria, 
    Communications of the ACM, Vol. 56, No. 3, March 2013, pages 56-61.
    http://dl.acm.org/citation.cfm?id=2428572

    Various techniques to make web-based mobile applications faster.
    It would be interesting to try these techniques on the two
    versions of NewsStand:
    1.  App version
    2.  Mobile version
    This could be attempted for all three of the iPhone, Android, and
    Windows Phone.  The author suggests a number of methods.  This
    could be done as a group project of several teams with the winner
    being given the higher grades.

    One of the key research issues here is how to make map-based
    applications on the web faster since much data is repeatedly
    transferred to the device.  So, some issues are: 
    1.  Minimizing connections and data across the network
    2.  Minimizing image requests (i.e., map tiles)
    3.  Leveraging local storage and caching
    4.  Pre-fetching and caching data (how different from item 3?)
    5.  Use non-blocking I/O (Explain relevance here)
    6.  Avoiding redirects and minimizing DNS lookups
    7.  Use http pilelining and SPDY (not sure about this!)
    8.  Sending the right data:
        a.  Use limit and offset to get results
        b.  Support partial response and partial update
        c.  Avoid or minimize cookies
        d.  Establish device profiles for APIs (e.g., different
            results for different form factors)

11. Read the article:
    A Scalable Content-Addressable Network by 
    S. Ratnasamy, P. Franics, M. Handley, R. Karp, and S. Shenker
    SIGCOMM 2001
    It describes the CAN method which makes use of a quadtree
    representation.  Look at the paper more closely and determine what
    quadtree research results (probably nearest neighbor computation,
    and linear quadtrees) could be used here to improve the
    algorithms, etc.  This could be a dead-end project in the sense
    that there is nothing substantive in it.  Whoever takes this
    project needs to have an exit plan in terms of an alternative
    project should this one not pan out.

12. Read the article 
    Ptolemaic indexing
    ML Hetland - Journal of Computational Geometry, 2015
 
    Abstract This paper discusses a new family of bounds for use in
    similarity search, related to those used in metric indexing, but
    based on Ptolemy's inequality, rather than the metric
    axioms. Ptolemy's inequality holds for the well-known Euclidean 
    distance, but is also ...

    http://scholar.google.com/scholar_url?url=http://www.jocg.org/index.php/jocg/article/download/42/85&hl=en&sa=X&scisig=AAGBfm0vUOB6jdwnrhEsEgRKEHICSdhdUQ&nossl=1&oi=scholaralrt

Definition of Ptolemaic distance:  

    Even though most distance-based indexing has centered on the metric
    axioms, this paper focuses on another property, known as Ptolemy's
    inequality, which states that
    xv \cdot yu \leq xy \cdot uv   (1)
    for all objects x; y; u; v. Premetrics satisfying this inequality
    are called Ptolemaic. [2].
    In the terms of the Euclidean plane: For any quadrilateral, the
    sum of the pairwise products of opposing sides is greater than or
    equal to the product of the diagonals (see Fig. 1).
    It is a well-known fact that Euclidean distance is Ptolemaic
    [15]. There are many proofs for this, some quite involved, but it
    can be shown quite simply, using the idea of inversion [16].

    Definition of a premetric=dis-similarity function

    Project is to see if incremental nearest neighbor method of
    Hjaltason and Samet work in this context.  The paper mentions
    these algorithms but not clear if they work.

13. Read the article
    Bodo Manthey, Smoothed Analysis of Local Search Algorithms.
    Keynote talk at WADS'15, Victoria, British Columbia, Canada, Aug. 2015.
    http://wwwhome.math.utwente.nl/~mantheyb/misc/WADS2015_Manthey_LocalSearch.pdf
    Look into it and find connection to the iNN algorithm of
    Hjalstason and Samet, if any.

14. Read the article
    JiTTree: A Just-in-Time Compiled Sparse GPU Volume Data Structure
    M. Labschutz, S. Bruckner, M.E. Groller, and M. Hadwiger
    IEEE TVCG August 15, 2015 online postin
    Abstract:  Sparse volume data structures enable the efficient
    representation of large but sparse volumes in GPU memory for
    computation and visualization. However, the choice of a specific
    data structure for a given data set depends on several factors,
    such as the ... 

    May be of interest for volume data structures for GPU usage.
    Need to look at more carefully.  Compare with PK-tree or
    adapt PK-tree to this environment.  A possible CMSC 725 project.

MOST RECENTLY ADDED TOPICS on September 1, 2015.

1.  Trendsspottr analog - one for news and one for tweets

2.  Expand on work of Rongjian Lan for the spatiotemporal NewsStand.
    The first item of business is to fix the Tour.  Next, try to
    include news articles over a longer period of time and use his
    visalizer in the form of a heat map.  This would show the
    migration of articles about a specific cluster over time and
    space.

    Rongjian Lan's 2014 SIGSPATIAL Workshop on HEALTHGIS is a starting
    point.

3.  Research on caption generation (see 2015 MSR Faculty Summit)

4.  Compare News Agregators mentioned in the September 2015 issue of
    CACM with NewsStand and enhance NewsStand to contain some of them.

5.  Implement NewsStand (Adapt it) to run on the Google Watch.  There
    are two variants.  The SAMSUNG and the MOTOROLA.  We could have a
    competiion where we have teams of two students.

6.  Enhance WeiboStand to have same UI as TwitterStand and same
    clustering.  Also make use of Microsoft Translator.  It appeared
    in the 2014 SIGSPATIAL Workshop on Location-Based Social Networks
    and was a project in Fall 2013 by Cheng Fu.

7. Implement TRUE semantic caching in NewsStand and differentiate
   from Chung.  Need REAL performance data.

   This paper appeared in the 2013 SIGSPATIAL Conference

8.  Determine sun exposure for a particular house via data from
    sources such as Google Street View, Zilow, others?  Also for a
    simple tent with four bags to hold it and you want to see the
    shaded area as time goes on.

9.  Modify WAZE and navigation programs to understand elevations
    where the targets have different heights like differentiating
    between the arrival and departure level at Ben Gurion airport
    in Israel for Eldan car return as well as levels in a parking
    structure, etc.  In other words, we should modify the
    representaion of the road network to include such features.
    For exmaple, look at SILC and the distance oracle.

10.  "Last mile" navigation systems.  Modify one of them such as
     WAZE to take into account parking in determining a path to
     your destination.  In other words, the destination includes
     more than one point.  This route could be optimized by a
     minimum price for the time period and the distance from the
     true destination.  Thus we are adding temporal factors to the
     shortest path computation:
   
     a.  Minimize parking cost and distance from true destination
         in terms of walking, taking shuttle bus, etc.

     b.  Give duration of stay as a parameter in order to determine
         the optimal path

     This could be used for determining a route to the airport
     which can be further differentiated by whether you will use
     valet parking, a shuttle bus, etc.  What we see here is a muh
     more detailed shortest path and navigation plan.  Currently,
     programs like Google Directions take into account factors such
     as navigation mode (e.g., walking, driving, and public
     transit) but not the "last mile" which is where people really
     get lost.  Our oracle model shows this to be true where we see
     that the shortest paths between many sources and destinations
     are very similar.

11.  Look at Bruno Martins references for geotagging and apply to
     NewsStand and evaluate.

12. Enhance cleanup paper which is a system that cleans up RSS feed
    web pages and extracts images, videos, etc, especially in terms
    of excluding ads.  A paper was written about the NewsStand cleanup
    system by Ben Teitler but reviewers asked for more work.  Shangfu
    Peng knows about its current state.  The same could also be
    applied to cleaning up videos.  Larry Davis's former Turkish
    student Ismail Haritaoglu has a company that is interested in such
    a technology as they serve videos to major sports networks.  

13. Transcription of videos in NewsStand so that can cluster their
    contents better.    c.  Position of the SUN

14  ORCAM glasses for Alzheimer patients so that only keep track of
    people in your photo album and not record what you see.  Instead,
    if you see someont, you check your own personal photo library to
    see if they have been seen before and who they are.  Saves
    embarassament of not reconginizing your spouse.

15. Windowseat

16. WAZE social network for finding gas stations and toilets.

PROJECTS SUGGESTED BY SHANGFU PENG ON September 1, 2015:

We have built the following systems:

1.  STEWARD:  http://steward.umiacs.umd.edu/
2.  NewsStand:  http://newsstand.umiacs.umd.edu/
3.  TwitterStand (new): http://twitterstand.umiacs.umd.edu/News/
4.  PhotoStand: http://newsstand.umiacs.umd.edu/photostand-jack/
5.  Geollery (Instagram Images): http://sametphp.umiacs.umd.edu/geollery/
6.  ASDO: http://sametnginx.umiacs.umd.edu/


Possible Projects:

1.  Generalized Quad-tree for Database System.
    This is Similar to Gist in PostgreSQL.
    You need to develop the general methods and implement them for a
    Quad-tree generalized data structure. Then users can use your data
    structure to easily specify numerous quadtree variants such as the
        MX-Quadtree, PM-Quadtree, PM2, PM3, Edge Quadtree, PMR
    Quadtree, loose Quadtree, etc.   

    Related Work is the SP-GIST work  of Aref et al.  If you do this
    project, you need to be much better!

    W.~G. Aref and I.~F. Ilyas.
    An extensible index for spatial databases.
    In {\em Proceedings of the 13th International Conference on
    Scientific and Statistical Database Management}, pages 49--58,
    Fairfax, VA, July 2001.

   W.~G. Aref and I.~F. Ilyas.
   {SP-GiST}: an extensible database index for supporting space
   partitioning trees.
   {\em Journal of Intelligent Information Systems}, 17(2--3):215--240,
   December 2001.
   Also see {\em Proceedings of the 13th International Conference on
   Scientific and Statistical Database Management}, pages 49--58, Fairfax, VA,
   July 2001.

    Aslo see the paper titled "Generalized Search Trees for Database
    Systems"

    J.~M. Hellerstein, J.~F. Naughton, and A.~Pfeffer.
    Generalized search trees for database systems.
    In {\em Proceedings of the 21st International Conference on Very
    Large Data Bases ({VLDB})}, U.~Dayal, P.~M.~D. Gray, and S.~Nishio, eds.,
    pages 562--573, Zurich, Switzerland, September 1995.
    \url{http://gist.cs.berkeley.edu/}.

    Some of the key Gist methods are:  Consistent,  Union, Compress,
    Decompress, Penalty, PickSplit, Search, Insert, Split, Delete, etc.

    Gist can implement:  B+ tree, R-tree, hB-tree, RD-tree

2.  Dynamic Gazetteer Built Up from Social Media and News
    Motivation: This is the first step of “spatial
    intelligence”. GeoName.org is currently the largest geographical
    database that spans all countries and contains over eight million
    place names. However, it is a static gazetteer. Upon encountering
    a toponym, both GeoNames and NewsStand generate all the possible
    interpretations for it and order them by population.  Instead, we
    want to vary the order of the interpretations.  One possibility is
    by using the number of their recent occurrences.  Another could be
    the identity of the user (person or news source) who is querying
    the gazetteer.  Still another could involve the time period in
    which the query is being posed.  Another ordering could be based
    on the average understanding of people for the toponym. If we
    think of the gazetteer as a well-read person, then the probabilities
    are equivalent to his spatial knowledge background.  All previous
    gazetteers assume that the person's knowledge background is
    static. However, in fact, the spatial knowledge background is
    really dynamic as it varies over time.  Moreover, the spatial
    knowledge background should depend on the person with whom the
    well-read person is conversing.

    One solution is that “dynamic” has several meanings:
    
    A.  At a certain time or period, some words may imply a certain
        location. For example, during the period that Brazil is
        hosting the FIFA world cup, the word "soccer" may imply
        Brazil. Before the next soccer game for country A, the
        occurrence of the word pair of "A soccer" may imply the city
        in Brazil that where the next match for country A is being
        held.  Similarly, the names of some famous people names may
        also refer to a city or a location when the people visit this
        city.  In addition, if a big event happened in a small town,
        e.g., Paris, Texas, then when people say “Paris”, it refers to
        Paris Texas with a higher probability than to Paris, France.
        Thus “dynamic” means that the interpretations of each toponym
        may vary over time so that some new interpretations come into
        play while others vanish.  In additions, the probabilities of the
        interpretations may change

    B.  The probabilities may also vary according to the identity of
        the users.  For example, when a person who lives in or travels
        to Paris, France say “Paris”, it always refers “Paris France” rather
        than “Paris Texas”, and vice versa.

        Newsstand, TwitterStand, and Instagram can be used to develop
        the dynamic gazetteer. In these systems, although there are
        many articles with which no location information is known,
        we have GOS information associated with some of the articles
        and/or tweets, so that we know with 100% certainty the related
        location interpretation.  The news articles, tweets, and
        photos for which we have the 100% certainty information will
        be use to build the dynamic gazetteer on top of the static
        gazetteer. 

3.  a.  Discovering regions of different functions using social media
        postings.
    b.  Discovering users that are spatially similar using social
        media postings.

    In the NLP community, people focus on pairs (words, document) to
    discover topic-models on a document set. In the GIS community, we
    usually do a partition on the spatial domain which is indexed
    using for example, a quadtree, R-tree, and Grid based partition. On
    the other hand, the daily activities of people as chronicled by
    their social media posts are usually associated with
    geotags.  Thus, by analogy, for (a) we have pairs such as   
    (social media activity, spatial region) and for (b) we have pairs
    such as (spatial region, user).

    We can either treat every spatial region as a “document”, or treat
    every user as a “document”. In this case, we can borrow some NLP
    methods, especially topic-model, to get some interesting
    results. The social media data can be from Twitter (TwitterStand),
    Instagram (Geollery), or any other sources.    

    For (a), you need to find the region's functions in terms of their
    associated social media postings.

    For (b), you need to find similar users based on similar spatial
    regions which are based on their social media postings.

    A related paper from Microsoft Asia:
    Discovering Regions of Different Functions in a City Using Human
    Mobility and POIs.   

    This paper explores the pair (human mobility and POIs, spatial
    region) to discover regions of different functions, e.g.,
    residence, governmental and public organizations, business,
    shopping mall, etc.  

    This could be helped via the aid of crowdsourcing methods.

4.  Embedded web map API
    Provide embedded web map API that mapping text contents provided
    by users to map, so that let user easily create a map web page
    that show their own contents.  
    It is like http://www3.clustrmaps.com/   or
    https://www.revolvermaps.com/ that records the visitors for a
    website. In this project, you need to provide the embedded codes
    to user, then user can use your codes to show their own contents
    on a map.  

    For example, WikiCFP (a wiki calls for papers,
    http://www.wikicfp.com/cfp/) may want to show the the hold places
    of recent calls for papers conferences in a map (like Google
    Maps). But it is a little waste time if they build the page from
    zero. In this project, you are building a framework that help
    internet scale users automatically generate a beautiful web page
    to show the contents specified by users. Note that your framework
    should understand spatial words, i.e., detect place names, and
    mapping place names to latitude/longitude location. 


NOT USED:

1.  Investigate the influence that using road network distance metric
    or Euclidean distance metric in spatial ranking results. In
    particular, what kinds of spatial queries are good to use road
    network distance metric, and what kinds of spatial queries are
    good to use Euclidean distance metric. You should provide enough
    theoretical and experimental evidences. 

    Background: there are a lot of location-based services today. Many
    of them just use Euclidean distance metric as ranking function
    for spatial queries, e.g., find the several nearest restaurants
    in Yelp. However, in this case, users prefer to know the drive
    distance (road network distance) indeed.  In case of that you can
    compute road network distance as quickly as Euclidean distance,
    what do things change? 
    Our ASDO demo can quickly return the shortest distance between two
    points in the USA region, e.g., 60000 queries per sec. And we also
    provide ASDO web API to do that. For example, using the web API to
    get the result such as  

CURL http://sametnginx.umiacs.umd.edu/usaapi?lat1=43.45&lon1=-101.99&lat2=40.15&lon2=-94.57
      You can use this API to get the road network distance (Option).