Q1. Is this data set similar to those given to real-life analysts?
A1. The easy answer to this question is "yes" and "no". Real-life analysts are given anything from hundreds of thousands of network traffic records, to photographs and excel spreadsheets, to forensic ballistic evidence. Much depends on what question that analyst is being asked to answer, and the mission of the agency for which the analyst works. The data set with which you are working is a realistic one, and contains all manner of information, important and not, through which a "real life" analyst would have to sift in order to provide a hypothesis to a decision-maker. The plot of the dataset isn't one you might find on the front page of the Washington Post, but then stranger things have happened!
Q2. We're tool builders, not analysts. Can we participate without an analyst on our team?
A2. Yes! When it comes down to it, analysis is simply using a systematic logical thinking methodology; like using the "scientific method" for an experiment. There are standard analysis methods that are well documented and that you could reference and use in your efforts, for example: Link Analysis, Analysis of Competing Hypothesis, Social Network Analysis.
If you are interested to learn more, there are also a couple of books you might want to look at:
Heuer, Jr. Richard J, Psychology of Intelligence Analysis
Morgan D. Jones, The Thinker's Toolkit: 14 Powerful Techniques for Problem Solving
A nice article about subjective thinking and competing hypotheses is at:
http://www.dodccrp.org/events/10th_ICCRTS/CD/papers/126.pdf (or our local copy of the paper)
or see the presentations at:
Another site to explore:
BUT remember that we are interested in new approaches and ideas as well! Surprise us… What really matters here is answering the questions.
Q3. I found some interesting anomalies in the dataset. I should report these, right?
A3. Possibly -- only report data anomalies if they are relevant to your hypotheses and/or conclusions. For example, if you found that the days of the news articles are only Monday, Wednesday, and Friday, that might be an anomaly when just considering data, but you shouldn't report that unless you can find a link to the scenario. However, if you found that stories about "Sam" were always associated with Merino sheep and Merino sheep play a part in your evolving hypotheses, then you should report this in some way.
Q4. Could you move the deadline to August? That would allow us to have a summer intern to help out.
A4. Unfortunately the July deadline IS firm as we need to determine who will participate in the live session, and also to collect the camera ready materials in time.
Q5. How much you are allowing teams to "build a tool to fit this data" – i.e. we could wind up building a tool incrementally, trying to solve the problem as we went thru it and bringing in custom pieces as needed, even building more special tools to present the "answer" afterwards. I wasn't sure if that was allowed. Some contests (like TREC) are not run that way (e.g. TREC is more "run your best existing system on this test data", you're not allowed to look at the test data). Then we definitely couldn't participate, as we'd have to pull in a lot of things to extend our system to work with this type of unstructured data.
A5. (Updated 07/2007) Your approach is fine for the contest, but keep in mind that if your tool does well and is selected for the live event at the symposium, it will be used for a different – simpler but similar – problem. The live event at the symposium is not a contest but an opportunity for top scoring teams to get feedback from analysts. Teams will only be given the data only an hour or two prior to the event. Teams will be comprised of one or two members from the tool builder team and a professional analyst, working together to assess as much as possible of a new situation in a few hours. It is important that your contest submission describes the process by which you arrived at answers and identify whether your success only can occur for this specific dataset (in other words does your process generalizes?) If you feel it does then that's great.
Q6. Can PNNL employees participate?
A6. PNNL employees cannot participate if their team is not clearly separated from the group which created the dataset. If the separation is clear, the team can submit their entry to the contest but will not considered for a prize (i.e. they will be “hors concours”).
Q7. Are the teams who use the pre-processed data judged separately from the teams who use only the raw data? If not, how do the criteria for each entry type differ?
A7. The teams who used the pre-processed data will
be judged in a separate category than those teams who use the raw data and do
their own processing. The same criteria for each type of entry will,
however, be the same.
Q8. What tool did you use for the preprocessing:
We used MITRE’s Alembic with some modifications and some manual editing. See http://www.mitre.org/tech/alembic-workbench for details on Alembic.
Q9. Was metadata extracted from the pictures (and provided in the preprocessed data set)?
NO, we only preprocessed the text, so you have to look at the pictures yourselves and add that information.
Questions? Email the Contest Chairs