Skip to main content



When You're Hot, You're Hot -- And the Computer Knows It

By Gabe Goldberg, HCIL Media Fellow

We've all got favorite authors, books, movies, and music. But while we
can easily distinguish our likes from dislikes, it's sometimes hard
explaining our choices to others, or even understanding them ourselves!
And many peoples' picks are eclectic and instinctual, lacking unifying
themes.

So imagine difficulties scholars face when interpreting literary works.
The good news is that current digital libraries make abundant material
(mostly documents but including some images and multimedia) available
for access, search and retrieval. Text mining and machine learning are
increasingly common, applying classification techniques in areas such as
industry, intelligence, defense, law enforcement, and the sciences. But
computers don't yet contribute much to humanities scholars' basic
mission: critical interpretation.

This dichotomy may be explained by informal observations, revealing two
main categories of potential technology users among literary scholars.
The first is the relatively few enthusiastic computer users who avidly
adopt new technology. The second, a broader base of scholars less
interested in computational tools, is unlikely to use online analysis
tools unless they offer user-friendly interfaces and are shown to
benefit their peers.

But the new area of literary text mining isn't quite like seeking gold
or drilling for oil. Though common literary text-mining tasks can be
modeled as classification problems such as authorship attribution,
stylistic analysis, and genre analysis, literary text classification
tasks are broader than simple topic spotting. Scholars must assign
labels such as topics, styles, genres, authors, eras, and other literary
and historical concepts combining basic categories.

To apply technology in this area, the new Nora Project
(www.noraproject.org) unites multidisciplinary teams from five
institutions and multiple domains, including the humanities, information
science, and computer science. Nora aims to develop an architecture and
tools for non-technical literary scholars to employ text mining,
starting with some five gigabytes of 18th and 19th century British and
American literature.

A specific realistic problem was needed to simultaneously guide the
project's design and engage scholars and students in the investigation.
A collection of about 300 letters from poet Emily Dickinson to her
sister-in-law, Susan Huntington (Gilbert) Dickinson, was selected for
analysis, to address a perennial debate: what constitutes the erotic in
Dickinson's writings.

To begin, a Dickinson expert rated documents as "hot" or "not hot". The
former label was shorthand for anything considered erotic, that is,
being flirtatious or seductive, having sexual connotations, or aiming to
pull in the addressee with attention to the physical, even arousal. This
labeling provided a baseline for evaluating classification algorithms;
experiments then tested the data mining techniques' accuracy.

The expert was interviewed about her initial beliefs regarding
indicators of eroticism in the letters. These included certain words,
rhetoric of similarity (e.g., "Each to Each"), and document mutilations
such as erasures, scissorings, or ink-overs.

To be effective, a text mining system must be trained by its expert
user, rather than imposing its requirements on the user. The Dickinson
project began with users being presented with a list of documents.
Documents are assigned one of five color circle values ranging from red
(hot) to black (not). A representative sample (15 hot and 15 not-hot) is
then used as a training set by the automatic data mining classifier.
Documents not rated manually are assigned colored squares representing
the likelihood that it's a hot document. The expert can then browse
documents and accept or reject the machine-suggested classification.
Documents can be sorted by predicted hot-ness to review extreme ends of
the spectrum. As with all data mining techniques, classification quality
depends strongly on the accuracy of the training set. So users are
encouraged to re-run the prediction after validating additional ratings.
After the first classification, a user can review words suggested as
meaningful indicators.

After the experiment, the Dickinson expert felt that the data mining
results shed new light on the texts examined despite her having already
studied them extensively. The early evaluation confirmed that the basic
user interface is usable by non-specialists -- a happy improvement on
most text-mining tools whose users require assistance from computer
experts.


Tech Reports
Video Reports
Annual Symposium

News
Seminars + Events
Calendar
HCIL Seminar Series
Annual Symposium
HCIL Service Grants
Events Archives
Awards
HCIL Conference Travel Award
Job Openings
For the Press
HCIL Overview
Become a Member
Collaborators
Collaborating Groups + People
Academic Visitors
Join our Mailing List
Contact Us
Visit Us
HCIL Store
Give the HCIL a Hand
HCIL T-shirts for Sale
Our Lighter Side
HCIL Memories Page
Faculty/ Staff
Students
Ph.D. Alumni
Past Members
Research Areas
Communities
Design Process
Digital Libraries
Education
Physical Devices
Public Access
Visualization
Research Histories
Faculty Listed by Research
Project Highlights
Project Screenshots
Publications and TRs
Videos
Books
Products
Presentations
Studying HCI
Masters in HCI
PhD in HCI
Visiting Scholars
Class Websites
Sponsor our Research
Sponsor our Annual Symposium
Active Sponsorship
Industrial Visitors

Web Accessibility