Return to the Vast Challenge 2008 task description

Sample Detailed Answer

(extracted from the VAST 2007 contest entry submitted by Georgia Tech)

 

 

Our system Jigsaw does not have capabilities for finding themes or concepts in a document collection.  Instead, it acts more as a visual index, helping to show which documents are connected to each other and which are relevant to a line of investigation being pursued.  Consequently, we began working on the problem by dividing the news report collection into four pieces (for the four people on our team doing the investigation).  Each of us skimmed the 350+ reports in our own unique subset just to become familiar with general themes discussed in those documents.  We also jotted down notes about potential people, organizations or events to study further. 

 

Next, we came together and used Jigsaw to examine the entire news report collection.  Jigsaw expects an xml file as input with the file identifying the unique documents and entities in the documents.  We wrote a translator that would change the text reports and the pre-identified entities from the contest data set into the xml form that Jigsaw can read.  We then ran Jigsaw and explored a number of the potential leads that we each identified by our initial skim of the reports.  What we looked for at first were connections across entities, essentially the same people, organizations or incidents being discussed in multiple reports.  Jigsaw provides multiple views of the documents and entities so it is extremely advantageous to have a lot of screen real estate.  In Figure 1 below, we show the workstation where we conducted our investigations.  It has four monitors.

 

Figure 1: View of the workstation configuration for our investigations with Jigsaw.  Having so many pixels to work with is a big advantage.

 

Surprisingly, there was relatively little in the way of connections across entities in the documents.  After about 6 or 7 hours of exploration, we really had no solid leads, just many, many possibilities.  So we went back and some of us read sets of reports that we hadn’t looked at before.  At that point, we began to identify some potential “interesting” activities.  What was clear here was that the time we spent exploring the documents in Jigsaw was not wasted time.  It helped us become more familiar with many different things going on in the reports.  Thus, new more deliberate examinations and readings of the documents began to turn up more promising leads.  We began to find connections across some actors and organizations in the data set.

 

We were curious, however, why those connections did not show up in Jigsaw initially.  Upon returning to the system, we learned why.  Some of the key entities in the plot we uncovered (r’Bear, Madhi Kim, Global Ways, Cesar Gil, etc.) were either identified as entities in only some of the documents in which they appeared or they were not identified as entities at all.  Jigsaw can only visualize the document and entity information that it has to work with, so there was nothing for us to observe (connections-wise) in our first use of the system on the problem.

 

At this point, we decided that we needed to update the entity information across the document collection.  We started with the pre-identified entities and we wrote some programs that would scan all the text documents and identify places where these entities simply were missed.  This process resulted in adding more than 8000 new entity-to-document matches over the whole collection and the entity-connection-network became much more dense.  The drawback of this technique was that we also added more noise by multiplying unimportant or wrongly extracted entities.  Therefore, we manually checked the most frequent entities for validation and made a list of false positive entities (wrongly classified or extracted) for each entity type.  We excluded these entities from the document collection and we manually added previously unidentified entities that we noticed while reading the documents.  We also removed the report date from the list of date entities for a document.  Instead, we stored it as a special publication date field for the document.  This whole process provided us with a consistent connection network that was mostly cleaned up for false positives.  Since only one quarter of the entities across the entire collection appeared in more than one report, we added an option in Jigsaw that allows the user to filter out all entities appear in only one report.  Doing so allows the user to focus on highly connected entities at the beginning of the investigation and to add further entities when more specific questions arise later during the analysis.

 

Next we resumed exploring the documents using Jigsaw.  Now, it was much easier for us to track down different plot threads and explore relationships between actors and events.  Figure 2 shows the main window of Jigsaw that allows the analyst to query for entities, substrings of entities, or to search for words/expressions in documents.  It also shows the color scheme that is used in the graph and text views to encode entity types.  (For all our figures, click on the image on this page to reveal a larger figure that is more readable.)

 

Figure 2: Jigsaw main window.

 

On our second read of the news reports, we noticed one mentioning the rapper r’Bear being taken to the hospital with bumps on his face.  This seemed suspicious so we explored r’Bear in Jigsaw’s graph visualization.  Below in Figure 3, this is shown.  Documents are the larger white circles and the different types of entities are the smaller colored circles.  By expanding the reports with r’Bear in it, many other “interesting” entities surface such as Shravaana and Madhi Kim. 

 

Figure 3: Graph view begun by loading r’Bear, then showing connecting documents and expanding those documents to show included entities.

 

Next we would turn to the text view (shown below in Figure 4) and examine these reports.  In our text view, the entities are highlighted.  We cannot underestimate how important it is to simply read the reports carefully.  What Jigsaw is helpful with is identifying a small subset of reports on related topics that can be examined carefully.  By looking at the reports about r’Bear, we noticed the connections to Luella Vedric.

 

Figure 4: The set of reports relevant to r’Bear with one in focus showing the document text and identified entities.

 

Below in Figure 5, we started with a search on Luella Vedric and then we expanded the documents in which she appears to show the entities also appearing in those reports.  Double-clicking on an entitiy such as Vedric makes the connecting documents appear, then double-clicking on those documents draws out their contained entities around the document.

 

Figure 5: Exploration starting with Luella Vedric and exploring the documents in which she appears.

 

 

We found Vedric’s connections to Catherine (Collie) Carnes and examined the text reports about her.  This is where we noted the mention of the Assan Circus (shown below in Figure 6) which led to further investigations.  By exploring the entity “Assan” we found reports mentioning the Abdul Hassan alias.  Manual exploration of the importer/exporter spreadsheet file found the connection between Hassan and Global Ways.

 

Reading the reports about Vedric also made us notice the mention of musician “r’Bert” that we presume is r’Bear but is simply incorrectly reported or documented.

 

Figure 6: Report with Vedric that mentions friend Carnes and refers to the Assan circus.

 

Carnes was also mentioned in a report with Faron Gardner, so we investigated him too.  In Figure 7 below, Jigsaw’s List view is shown.  Here we have selected Gardner and Cesar Gil (highlighted in yellow) and we note that they are connected with many of the same entities, shown here are places and organizations.  We made the blog texts into documents and imported them into Jigsaw as well.  By examining these views and simply reading the blog, we noted that Cersar Gil was this chinshopes individual, and we found the connections between Cesar and Collie and Faron.  These are mentioned in his blog.

 

Figure 7: Jigsaw’s List view showing connections between Cesar Gil and Faron Gardner.

 

 

At various times in the investigation, we wanted to get a handle on the chronology of events we were focusing on.  Jigsaw’s timeline view, shown below in Figure 8, shows a report as a tower of entities positioned at its correct point (publication date) on the timeline.  To the right is the focus view on one particular report.  By sweeping out a region in one timeline (shown here in dark yellow), that portion of the timeline is reproduced on the next timeline up in more detail.  In the figure below this has been done twice.

 

 

Figure 8: Jigsaw’s timeline view.  This view shows some of the events involving r’Bear and Madhi Kim.

 

One technique we used a great deal in our investigations with Jigsaw was to gather a large set of potentially “interesting” reports into the graph view and then expand all the reports to show all their entities.  Next, by clicking the “Do Layout” button in the upper left, all these reports are drawn out along a circle in the view.  Entities connecting to only one report are drawn outside the circle, and entities connecting to more than one report are drawn inside.  Thus the set of entities inside the circle shows a kind of interconnected network of entities that should be examined much more closely.  By clicking on one of these entities and selecting it, the documents in which it appears will be brought into one Jigsaw’s text views (shown earlier) and they can be read carefully.  Figure 9 below shows such a set of interesting reports for the contest data.  Note the entities on the inside; many of which are involved in the solution we propose.

 

Figure 9: Use of the “Do Layout” command in the graph view.  All entities connecting to more than one document are drawn in the middle making it easier to focus on them.

 

Below in Figure 10 is a final graph view where we have filtered out all but the most important entities and documents with respect to our solution and we have carefully positioned the different reports and entities to make their connections a little more clear.  So this really is more of a documenting or explanatory view, not one that we would encounter during investigation.

 

Figure 10: A final cleaned-up view that could be used as documentation helping to tell the analysis story of this investigation.

 

Again, we cannot emphasize strongly enough how important the process of carefully reading the reports is.  Obviously, the problem with the contest data is that there are over 1500 reports.  Jigsaw is very helpful for exploring different entities in its graphical views and then having it load a small subset of the relevant documents in one of its text views.  We frequently found ourselves exploring different entities and we would have 4 or 5 different Jigsaw text views open, each with only a few documents inside.  We could then carefully examine those reports and it was easy to understand the connections between entities and how the pieces began to fit together.

 

Working in this way also underlined the absolute importance in our exploration environment: the four displays we would run the system on.  We simply need many pixels to spread out all the different document views.  Performing this exploration on one display would be extremely slow and burdensome because it would require so much window flipping.

 

Our analysis activities exposed a number of shortcomings in the Jigsaw system and thus the activities functioned very much in a formative evaluation sense.  We made a number of changes to each view in our system as we were working on the contest.  Probably the key missing feature in the system at this time is the ability to identify or remove entities while running the system and doing active investigations.  We plan to add that capability soon.

 

         

Web Accessibility