HCIL Logo
Home Research publications academics about members partnerships contact

 

 

 

Bio-Informatics Visualization Workshop

Human-Computer Interaction Lab,
University of Maryland, College Park

May 30, 2002, 9:30AM -4:30PM
AV Willams Building

 ** WORKSHOP IS FULL ** 

Organizers

Eric Baehrecke, University of Maryland Biotechnology Institute baehreck@umbi.umd.edu
Ben Shneiderman, UMCP Department of Computer Science ben@cs.umd.edu
This workshop (expanded to include 60-70 participants, 14 speakers) will present current implementations and challenges for researchers seeking to understand genomic and proteomic data by applying advanced visualization techniques. Current topics include DNA microarray chip data, genomic sequences, protein structures, gene ontologies, and biological pathways. Visualization methods include 2-d and 3-d scattergrams, color-coded heatmaps, hierarchical treemaps, temporal data searching, topographic displays, and hierarchical clustering presentations. Related data mining techniques that are enhanced by visualization include supervised and unsupervised classification/categorization, principal components analysis, and multi-dimensional scaling. Biologically relevant tasks include comparing samples, identifying similar and different genes, identifying targets, defining pathways, and generating hypotheses. 

Sponsors

Celera
IBM
Spotfire
University of Maryland Biotechnology Institute

Participation

Potential attendees should request permission to participate by sending an email to Ben Shneiderman (ben@cs.umd.edu) identifying their background and interest in the topic by April 28. Responses will be made by May 3.

Students interested in attending the workshop should apply by sending an email to Jinwook Seo (jinwook@cs.umd.edu) describing their academic standing, explaining their research, and giving the reasons they want to attend. Depending on space availability, free student registrations will be announced by May 3.

University of Maryland Participants

Harry Hochheiser, hsh@cs.umd.edu
Jinwook Seo,  jinwook@cs.umd.edu
Amitabh Varshney, varshney@cs.umd.edu
 

Sponsor Participants

Peter W. Li, Celera, peter.li@celera.com
Russell Turner, Celera,
Tanveer Syeda-Mahmood, IBM Life Sciences Solutions, stf@almaden.ibm.com
Bernice Rogowitz, IBM,rogowtz@watson.ibm.com
Christopher Ahlberg, Spotfire, ahlberg@spotfire.com
 

Confirmed Presenters

Allan Kuchinsky, Agilent Corp., allan_kuchinsky@agilent.com
Annette Adler, Agilent Corp., annette_adler@agilent.com
Owen White,TIGR, owhite@tigr.org
Daniel B. Carr, George Mason University, dcarr@galaxy.gmu.edu
Terry Gaasterland, Rockefeller Institute,gaaster@rockvax.rockefeller.edu
Brian Wylie VisWave, wylie@viswave.com
Maggie Werner-Washburne,VisWave & University of New Mexico
Naren Ramakrishnan, Virginia Tech,naren@cs.vt.edu
Cliff Shaffer, Virginia Tech,shaffer@shaffer.cs.vt.edu
Marc Vass, Virginia Tech,mvass@vt.edu
Yidong Chen, NHGRI  

Attendees

Stephen M. Mount, University of Maryland, smount@wam.umd.edu

Abstracts

TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data

Harry Hochheiser and Ben Shneiderman
University of Maryland, Department of Computer Science
PowerPoint slides

Microarray experiments are often used to examine changes in gene expression over time. Generally, these data sets are analyzed using clusters, self-organizing maps, heat maps, and other standard microarray analysis tools. TimeSearcher is a general purpose tool for exploration and pattern identification in time series data. TimeSearcher is based on the use of timeboxes - rectangular, direct-manipulation queries - to support interactive exploration via dynamic queries (100ms response time). TimeSearcher also provides overviews of query results and drag-and-drop support for query-by-example. The use of TimeSearcher for analysis of microarray time series will be discussed, along with other potential applications of TimeSearcher to bioinformatics problems.
http://www.cs.umd.edu/hcil/timesearcher

 

Hierarchical Clustering Explorer - Understanding Hierarchical Clustering Results by Interactive Exploration of Dendrograms: A Case Study with Genomic Microarray

Jinwook Seo and Ben Shneiderman
University of Maryland, Department of Computer Science
PowerPoint slides

Hierarchical clustering is widely used to find patterns in multi-dimensional datasets, especially for genomic microarray data. Finding groups of genes with similar expression patterns can lead to better understanding of the functions of genes. Current visualization tools for hierarchical clustering that provide static outputs on screens or even large printouts can be improved by adding interactive exploration tools. HCE (Hierarchical Clustering Explorer) is a visualization tool that integrates four general techniques that could be used in interactive explorations of hierarchical clustering results: (1) overview of the entire dataset, coupled with a detail view so that high-level patterns and hot spots can be easily found and examined, (2) dynamic query controls so that users can restrict the number of clusters they view at a time and show those clusters more clearly, (3) coordinated displays: the overview mosaic has a bi-directional link to 2-dimensional scattergrams, (4) cluster comparisons to allow researchers to see how different clustering algorithms group the genes. In this talk, Iíll discuss how HCE can be used for the clustering results of microarray data, together with some other issues in multi-dimensional data visualization. http://www.cs.umd.edu/hcil/multi-cluster
 

Visualization for cancer classification by gene expression profiling

Yidong Chen
NHGRI
PDF slides

With the advance of microarray technologies, biologists are currently capable of observing the abundance of transcripts from tens of thousands genes in biological samples, enabling the exploration of the dynamics of transcription and interaction between genes on a genome-wide scale. With the accumulation of gene expression datasets, the challenging task of all microarray experiments is how to extract meaningful and trustworthy information out of thousands of genes that do not contribute in the designed experiments. To achieve this goal, many rigorous mathematical tools and computational software were introduced to the field, such as statistical techniques for data normalization, clustering algorithms, class prediction methods, ANOVA, and gene-gene interaction studies. Realizing that many of gene expression experiments collect relatively small number of samples from patients, cell-lines, or other biological samples, rendering some of popular statistical tools meaningless, the development of data visualization techniques is crucial in the earlier stage of microarray experiment design. To assist biologists to efficiently organize, and therefore, understand the properties of their dataset, we introduced and implemented the multidimensional scaling (MDS) technique to provide direct appreciation of the clustering outcome, various clustering techniques for data organizing and pattern finding purpose, techniques for visualizing gene-gene interaction via coefficient of determination (CoD), and many others. In this presentation, we will use one of the gene expression profile studies of melanoma cancer samples in our lab to illustrate, step by step, the visualization tasks required in the lab, and many tools available at NHGRI microarray data analysis web site.

Discovering  Functional Similarity of Genes by Mining in Visualizations of Gene Profiles

Tanveer Syeda-Mahmood
IBM Almaden Research Center

Traditionally, visualization techniques have been used to illustrate the results of mining. Visualization scientists, on the other hand, have recognized that often the visualization itself can be a good source of mining  for further information. Automatic tools to mine such visualized representations, however, are lacking. In this talk, I will present a method for simultaneously discovering similarities between multiple time-varying  profiles that operates directly on the combined multi-dimensional visualization of  such profiles. Specifically, scale-space analysis is used to identify salient curvature changes in multi-dimensional curves forming the basis of similarity between time profiles.   An application of this technique for discovering functional similarities in genes will be discussed.
 

Interactive Graphical Display of Protein Structures

Amitabh Varshney
University of Maryland, Department of Computer Science

The recent successes in the human genome sequencing have taken us a step closer to the goal of designing novel therapeutic drugs. We are working on developing visual computing tools and technologies that will give scientists deeper insights in understanding the relationships between form and function in various biological proteins. The smooth, solvent-accessible molecular surface is useful for studying the structure and interactions of proteins, especially for testing the accessibility of a solvent in a molecule; for prediction of three-dimensional structures of biological macromolecules and assemblies; and for evaluating different docking conformations of molecules which can be used in drug design. I shall discuss a fast and efficient parallel algorithm for interactive computation of solvent-accessible smooth molecular surfaces. I shall also discuss some of our recent approaches to study surface complementarity and efficient algorithms for computing and visualizing molecular electrostatics.
 

Using Self-Similar Geometry to Represent Letter-Sequence-Indexed Statistics
With Application to Nucleotide and Peptide Docking

Daniel B. Carr 
George Mason University

The paper addresses the challenge of representing statistics indexed by sequences of letters. Letters of a sequence represent nucleotides or amino acids in the motivating applications. The number of letter combinations grows exponentially with sequence length.The challenge is to develop representations for the space of possibilities that are cognitively accessible and that convey scientific relationships. The approach described in this paper develops coordinate systems based on simple geometric structures: tetrahedrons in the case of 4 nucleotides and icosahedron face centers in the case of 20 amino acids. The paper demonstrates two self-similar coordinate generating mechanisms that help to provide cognitive accessibility: self-similarity at the same scale and at different scales. The coordinate systems directly represent short sequences of say 6 nucleotides or 3 amino acids and extend to longer sequences by connecting points.Layout variations modify the representations to produce simpler appearance and concentrate sequences with similar statistics. New visualization software also handles the representation of features in two-, three- and four-dimensional margin tables and provides dynamic options such as filtering. 
 

Numbers, Images and Geometry: Using Visualization to Explore Patterns Across Multiple Data Types in the Life Sciences

Bernice Rogowitz
IBM

Genome and Literature: Combining Two Massive data Sets through Ontologies

Peter Li
Celera

A challenge facing bioinformaticians in the era of post-genome research is the integration of genome data with other domains. One such domain is literature, which is massive and just as complex. Medline provides an easy access to the majority of the published literature that are of interest to biomedical researchers. While only the abstracts are available, it can nevertheless serve as a representative literature source for integration with genome data. A basic integration approach is to find common names and sequences "quoted" by both sources. A more semantic approach would take advantage of the active development of ontologies in both data sets, e.g. Gene Ontology for genes and MESH/UMLS for Medline. We will explore both approaches and the user interface challenges they present.
 

The Celera Genome Browser: A Tool for Visualizing and Annotating the Human Genome

Russell Turner
Celera

We present the Genome Browser, an interactive graphical tool for visualizing and curating the nucleotide sequences of large genomes, in particular, the human genome. This tool, developed by Celera Genomics and used by Celera scientists customers, permits raw nucleotide information to be visualized, together with accompanying annotation information. It also provides interactive capabilities for human curation of genes. The software is written completely in Java and has a three-tiered architecture with a high-performance "thick" graphical client, an EJB-based middle-tier server, and an Oracle database backend. This architecture allows a terabyte-sized genomic database containing annotations on sequences exceeding 3 Billion base-pairs in length to be viewed using a direct manipulation graphical user interface displaying tens of thousands of zoomable data points at a time. It also allows layering of additional user-specified data on top of the database data via an XML import capability. Curation operations are performed by the user using an interactive "drag-and-drop" style to create and modify gene and transcript information. Curation information is exported via XML files which can then be loaded into the database using a separate curation "promotion" utility. This combined XML and three-tiered data architecture provides sufficient flexibility to support a variety of different genomic data formats and curation workflows.
 

The Comprehensive Microbial Resource

Owen White, Lowell Umayam, Tanja Dickinson, Jeremy Peterson
TIGR
PowerPoint slides

One of the challenges presented by large-scale genome sequencing efforts is the effective display of information in a format that is accessible to the laboratory scientist. The Comprehensive Microbial Resource (CMR) contains all of the fully sequenced microbial genomes, the curation from the original sequencing centers, and further curation from TIGR (for those genomes sequenced outside TIGR). The interface to this database effectively "slices" the vast amounts of data in the sequencing databases in a wide variety of ways, allowing the user to formulate queries that search for specific genes as well as to investigate broader topics, such as genes that might serve as vaccine and drug targets. The web presentation of the CMR includes the comprehensive collection of bacterial genome sequences, curated information, and related informatics methodologies. The scientist can view genes within a genome and can also link across to related genes in other genomes. The effect is to be able to construct queries that include sequence searches, biological role, taxonomy, function, environment and other questions, and isolate the genes of interest. The database contains extensive curated data as well as pre-run homology searches to facilitate data mining. The interface allows the display of the results in numerous formats that will help the user ask more accurate questions. The methodology for populating the database, the user interface, and new methods for automated functional assignment will be presented.
 

Comparative Visualization of Genome-Scale Datasets

Brian Wylie
Maggie Werner-Washburne
VisWave
University of New Mexico,Department of Biology
PowerPoint slides

Genome-scale data presents incredible analytical challenges to biologists. Here we report the comparative, visual analysis of yeast gene-expression (cell cycle and exit from stationary phase/G0) and several protein-interaction datasets using VxInsight, a clustering and visualization tool to develop hypotheses, speed data mining and, thus, enhance the discovery process.  Differences in gene clusters between the gene-expression datasets for the two related biological processes led to new, testable
hypotheses.  For example, lack of clustering of G1-regulated (cell-cycle) genes in the exit from stationary phase dataset suggests that either the cells exiting stationary phase are not synchronous or that a subset of G1-regulated genes is required for this process.  Additionally, the relative lack of interactions between ribosomal proteins in both 2-hybrid datasets, which is easily observed as a function of gene expression, suggests that 2-hybrid methods may not be able to detect ribosomal protein interactions, possibly because the bait and prey proteins are incorporated into ribosomes in the nucleus.  Biologists tend to be visually oriented.  Thus, providing a tool that allows large datasets to be "learned" and queried visually enhances hypotheses development and, eventually, the design of these large experiments, as biologists learn to use visual analysis in designing genome-scale experiments to ask more specific and novel questions.
 

Supporting Collaborative Bio-Informatics Discovery with Visualization and Analytics

Christopher Ahlberg
Spotfire

Pharmaceutical discovery has over the last 10 years seen an explosion in data generated from high throughput technologies as well as from procurement of high value content - across the whole pharmaceutical discovery value chain. In addition to the underlying data explosion, pharma discovery is also facing a decision explosion where new ways of organizing research and development drives novel decision making approaches.

This presentation will draw from the speaker's experience in deploying visualization and analytics in large pharma and biotech over the last five years - and show successes and challenges - what works and what doesn't work. Further, key insights in how to make novel visualizations and algorithms matter beyond small groups of high end researchers will be presented - trying to show how the power of high end individuals can be spread to large user communities.
 

Building Biological Explanations for Gene Expression Patterns

Terry Gaasterland
Rockefeller Institute
 

Biological Storytelling: A Software Tool for Biological Information Organization Based upon Narrative Structure 

Allan Kuchinsky, Kathy Graham, David Moh, Michael L. Creech, Ketan Babaria, and Annette Adler
Agilent Corporation
PowerPoint slides

The work of molecular biologists seeking to understand the molecular basis of disease centers on identifying and interpreting the relationships of genes, proteins, and pathways in living organisms. While emerging technologies have provided powerful analysis tools to this end, they have also produced an explosion of data, which biologists need to make sense of. We have built software tools to support the synthesis activities of molecular biologists, in particular the activities of organizing, retrieving, using, sharing, and reusing diverse biological information. A key aspect of our approach, based upon the findings of user studies, is the use of narrative structure as a conceptual framework for developing and representing the ?story? of how genes, proteins, and other molecules interact in biological processes. Biological stories are represented both textually and graphically within a simple conceptual model of items, collections, and stories. Using our software, biologists can build up high-level graphical and narrative models of biological processes in living cells, interactively explore those models, and evaluate these models against detailed experimental data, using visual data overlays.
 

Modeling Intra-Cellular Regulatory Networks with Applications in Model Definition and Evaluation

Naren Ramakrishnan, Cliff Shaffer, and Marc Vass
Virginia Tech
PowerPoint slides

The JigCell Problem Solving Environment (PSE) provides experimentalists and modelers with a set of tools for modeling intra-cellular regulatory networks. Users define models in terms of chemical reactions entered into a Model Builder. Our approach simplifies model building through a spreadsheet metaphor that reduces visual clutter and segments the model into chunks that naturally fit the typical user's mental image. Specifications for simulating the model with specific parameters and initial conditions are made in a Run Builder. The Run Builder then takes the set of chemical equations and the various parameter settings to generate a set of ordinary differential equations. Several tools may then be used to explore the output produced by solving these ODEs. The Comparator quantitatively evaluates collections of experimental measurements and simulation results to assist the user in validating the model. Numerical and graphical visualizations are provided with support for external visualization packages. JigCell is currently being tested with frog egg extract models and budding yeast cell models from John Tyson's Computational Cell Biology Lab at Virginia Tech.
 

Last Modified 23 April 2003,11:21 AM hsh@cs.umd.edu