(extracted from the VAST 2007 contest entry submitted
by Georgia Tech)
Our system Jigsaw does not have capabilities for
finding themes or concepts in a document collection. Instead, it acts more as a visual index,
helping to show which documents are connected to each other and which are
relevant to a line of investigation being pursued. Consequently, we began working on the problem
by dividing the news report collection into four pieces (for the four people on
our team doing the investigation). Each
of us skimmed the 350+ reports in our own unique subset just to become familiar
with general themes discussed in those documents. We also jotted down notes about potential
people, organizations or events to study further.
Next, we came together and used Jigsaw to examine the
entire news report collection. Jigsaw
expects an xml file as input with the file identifying the unique documents and
entities in the documents. We wrote a
translator that would change the text reports and the pre-identified entities
from the contest data set into the xml form that Jigsaw can read. We then ran Jigsaw and explored a number of
the potential leads that we each identified by our initial skim of the
reports. What we looked for at first
were connections across entities, essentially the same people, organizations or
incidents being discussed in multiple reports.
Jigsaw provides multiple views of the documents and entities so it is
extremely advantageous to have a lot of screen real estate. In Figure 1 below, we show the workstation
where we conducted our investigations.
It has four monitors.
Figure 1: View of the workstation configuration for our
investigations with Jigsaw. Having so
many pixels to work with is a big advantage.
Surprisingly, there was relatively little in the way
of connections across entities in the documents. After about 6 or 7 hours of exploration, we
really had no solid leads, just many, many possibilities. So we went back and some of us read sets of
reports that we hadn’t looked at before.
At that point, we began to identify some potential “interesting”
activities. What was clear here was that
the time we spent exploring the documents in Jigsaw was not wasted time. It helped us become more familiar with many
different things going on in the reports.
Thus, new more deliberate examinations and readings of the documents
began to turn up more promising leads.
We began to find connections across some actors and organizations in the
data set.
We were curious, however, why those connections did
not show up in Jigsaw initially. Upon
returning to the system, we learned why.
Some of the key entities in the plot we uncovered (r’Bear, Madhi Kim,
At this point, we decided that we needed to update
the entity information across the document collection. We started with the pre-identified entities
and we wrote some programs that would scan all the text documents and identify
places where these entities simply were missed.
This process resulted in adding more than 8000 new entity-to-document
matches over the whole collection and the entity-connection-network became much
more dense. The drawback of this
technique was that we also added more noise by multiplying unimportant or
wrongly extracted entities. Therefore,
we manually checked the most frequent entities for validation and made a list
of false positive entities (wrongly classified or extracted) for each entity
type. We excluded these entities from
the document collection and we manually added previously unidentified entities
that we noticed while reading the documents.
We also removed the report date from the list of date entities for a
document. Instead, we stored it as a special
publication date field for the document.
This whole process provided us with a consistent connection network that
was mostly cleaned up for false positives.
Since only one quarter of the entities across the entire collection
appeared in more than one report, we added an option in Jigsaw that allows the
user to filter out all entities appear in only one report. Doing so allows the user to focus on highly
connected entities at the beginning of the investigation and to add further
entities when more specific questions arise later during the analysis.
Next we resumed exploring the documents using
Jigsaw. Now, it was much easier for us
to track down different plot threads and explore relationships between actors
and events. Figure 2 shows the main window
of Jigsaw that allows the analyst to query for entities, substrings of entities,
or to search for words/expressions in documents. It also shows the color scheme that is used
in the graph and text views to encode entity types. (For all our figures, click on the image on
this page to reveal a larger figure that is more readable.)
Figure 2: Jigsaw main window.
On our second read of the news reports, we noticed
one mentioning the rapper r’Bear being taken to the hospital with bumps on his
face. This seemed suspicious so we explored
r’Bear in Jigsaw’s graph visualization.
Below in Figure 3, this is shown.
Documents are the larger white circles and the different types of
entities are the smaller colored circles.
By expanding the reports with r’Bear in it, many other “interesting”
entities surface such as Shravaana and Madhi Kim.
Figure 3: Graph view begun by loading r’Bear, then
showing connecting documents and expanding those documents to show included
entities.
Next we would turn to the text view (shown below in
Figure 4) and examine these reports. In
our text view, the entities are highlighted.
We cannot underestimate how important it is to simply read the reports
carefully. What Jigsaw is helpful with
is identifying a small subset of reports on related topics that can be examined
carefully. By looking at the reports
about r’Bear, we noticed the connections to Luella Vedric.
Figure 4: The set of reports relevant to r’Bear with one
in focus showing the document text and identified entities.
Below in Figure 5, we started with a search on Luella
Vedric and then we expanded the documents in which she appears to show the
entities also appearing in those reports.
Double-clicking on an entitiy such as Vedric makes the connecting
documents appear, then double-clicking on those documents draws out their
contained entities around the document.
Figure 5: Exploration starting with Luella Vedric and
exploring the documents in which she appears.
We found Vedric’s connections to Catherine (Collie)
Carnes and examined the text reports about her.
This is where we noted the mention of the Assan Circus (shown below in
Figure 6) which led to further investigations.
By exploring the entity “Assan” we found reports mentioning the Abdul
Hassan alias. Manual exploration of the
importer/exporter spreadsheet file found the connection between Hassan and
Reading the reports about Vedric also made us notice
the mention of musician “r’Bert” that we presume is r’Bear but is simply
incorrectly reported or documented.
Figure 6: Report with Vedric that mentions friend Carnes
and refers to the Assan circus.
Carnes was also mentioned in a report with Faron
Gardner, so we investigated him too. In
Figure 7 below, Jigsaw’s List view is shown.
Here we have selected
Figure 7: Jigsaw’s List view showing connections between Cesar
Gil and Faron Gardner.
At various times in the investigation, we wanted to
get a handle on the chronology of events we were focusing on. Jigsaw’s timeline view, shown below in Figure
8, shows a report as a tower of entities positioned at its correct point
(publication date) on the timeline. To
the right is the focus view on one particular report. By sweeping out a region in one timeline
(shown here in dark yellow), that portion of the timeline is reproduced on the
next timeline up in more detail. In the
figure below this has been done twice.
Figure 8: Jigsaw’s timeline view. This view shows some of the events involving
r’Bear and Madhi Kim.
One technique we used a great deal in our investigations
with Jigsaw was to gather a large set of potentially “interesting” reports into
the graph view and then expand all the reports to show all their entities. Next, by clicking the “Do Layout” button in the
upper left, all these reports are drawn out along a circle in the view. Entities connecting to only one report are
drawn outside the circle, and entities connecting to more than one report are
drawn inside. Thus the set of entities
inside the circle shows a kind of interconnected network of entities that
should be examined much more closely. By
clicking on one of these entities and selecting it, the documents in which it
appears will be brought into one Jigsaw’s text views (shown earlier) and they
can be read carefully. Figure 9 below
shows such a set of interesting reports for the contest data. Note the entities on the inside; many of
which are involved in the solution we propose.
Figure 9: Use of the “Do Layout” command in the graph
view. All entities connecting to more
than one document are drawn in the middle making it easier to focus on them.
Below in Figure 10 is a final graph view where we
have filtered out all but the most important entities and documents with respect
to our solution and we have carefully positioned the different reports and
entities to make their connections a little more clear. So this really is more of a documenting or
explanatory view, not one that we would encounter during investigation.
Figure 10: A final cleaned-up view that
could be used as documentation helping to tell the analysis story of this
investigation.
Again, we cannot emphasize strongly enough how
important the process of carefully reading the reports is. Obviously, the problem with the contest data
is that there are over 1500 reports.
Jigsaw is very helpful for exploring different entities in its graphical
views and then having it load a small subset of the relevant documents in one
of its text views. We frequently found
ourselves exploring different entities and we would have 4 or 5 different
Jigsaw text views open, each with only a few documents inside. We could then carefully examine those reports
and it was easy to understand the connections between entities and how the
pieces began to fit together.
Working in this way also underlined the absolute
importance in our exploration environment: the four displays we would run the
system on. We simply need many pixels to
spread out all the different document views.
Performing this exploration on one display would be extremely slow and
burdensome because it would require so much window flipping.
Our analysis activities exposed a number of
shortcomings in the Jigsaw system and thus the activities functioned very much
in a formative evaluation sense. We made
a number of changes to each view in our system as we were working on the
contest. Probably the key missing
feature in the system at this time is the ability to identify or remove
entities while running the system and doing active investigations. We plan to add that capability soon.