| CMSC 838T Project 2
|
The goal of this project is to find a bioinformatics research topic related to
high-performance computing and obtain some preliminary results.
The motivating factor for your project should be to figure out how to take
advantage of increasing computation power to improve the quality of
bioinformatic applications, while taking into account the near-exponential
growth in the size of sequence databases. Earlier algorithms may be
oversimplified due to concerns about processing power. Try to find opportunities
to use computing power to save user effort.
You may work in 2-person groups (I will coordinate multiple groups on larger projects).
Here's a first pass at some project suggestions
Possible project topics
- Parallel sequence alignment / search algorithms
- compare existing parallel tools (MPI-blast, SMP blast, etc)
- develop & compare OpenMP / UPC / SHMEM versions of BLAST
- consider ways of improving parallel performance / throughput
- Experimental evaluation of bioinformatic algorithms
- sequence search / alignment algorithms
- multiple sequence alignment
- gene prediction techniques
- EST clustering algorithms
- noncoding RNA prediction
- Preprocessing sequence databases
- to improve sequence search
- to extract useful performance
- to compress information
- Any other good ideas you can think of...
More detailed project descriptions
1) Improving performance of sequence search algorithms
Source code for parallel versions of BLAST are available.
Evaluate parallel BLAST performance using
- source code for MPI-BLAST
- rewriting MPI-BLAST to use other parallel languages (UPC / SHMEM / OpenMP)
- compare throughput vs parallel speedup (develop on-line scheduling
algorithms depending on workload)
- examine cache / memory access characteristics of bioinformatics software
for possible improvements
- evaluate bioinformatics software to determine whether bottlenecks are in
CPU / cache / memory / I/O
- determine whether hardware accelerators are still cost-effective vs.
latest microprocessors (based on published results)
Links
- SGI parallel bioinformatic evaluation here...
- Parallel Smith-Waterman evaluation here...
- MPI-BLAST benchmark results here...
- DeCypher benchmark results here...
2) Evaluating sensitivity / specificity of sequence search algorithms
A number of papers have compared BLAST / FASTA /
Smith-Waterman algorithms for discovering distant members of protein families.
Repeat using
- new protein families
- new search algorithms (PHI-BLAST, MegaBlast, PatternHunter, GeneWise,
etc.)
- measurements of time/memory usage vs. database size
- non-coding RNA sequences
- your own ideas on improving search results
- evaluate additional bioinformatic algorithms (clustering, gene prediction,
etc.)
Links
- Assessing Sequence Comparison Methods
(PDF)...
- Sensitivity and Selectivity in Protein Similarity Searches...
(PDF)...
- PatternHunter...
(PDF)...
- nonCoding RNAs
(PDF)...
3) Preprocessing sequence databases
A number of researchers have suggested compressed sequence
database formats. Investigate issues:
- evaluate database performance (while compressed)
- preprocess sequence database for BLAST searches
- preprocess sequence database for useful information / motifs (most
frequent 25-mers, etc.)
- impact on genome-level comparisons / alignments
Links
- A Compression Algorithm for DNA Sequences...
(PDF)
Approach
- Conduct survey of related work (read related research papers)
- Write up a short description of proposed research before proceeding
- Initially concentrate on setting up tools / procedures
- Later focus on collecting experimental information
- Present preliminary results on last day of class
- Turn in short research paper describing project