CMSC 838T, Project 1

CMSC 838T Project 2

Basic Information

The goal of this project is to find a bioinformatics research topic related to high-performance computing and obtain some preliminary results.

The motivating factor for your project should be to figure out how to take advantage of increasing computation power to improve the quality of bioinformatic applications, while taking into account the near-exponential growth in the size of sequence databases. Earlier algorithms may be oversimplified due to concerns about processing power. Try to find opportunities to use computing power to save user effort.

You may work in 2-person groups (I will coordinate multiple groups on larger projects).

Here's a first pass at some project suggestions

Possible project topics

Parallel sequence alignment / search algorithms

compare existing parallel tools (MPI-blast, SMP blast, etc)
develop & compare OpenMP / UPC / SHMEM versions of BLAST
consider ways of improving parallel performance / throughput

Experimental evaluation of bioinformatic algorithms
- sequence search / alignment algorithms
- multiple sequence alignment
- gene prediction techniques
- EST clustering algorithms
- noncoding RNA prediction
Preprocessing sequence databases
- to improve sequence search
- to extract useful performance
- to compress information
Any other good ideas you can think of...

More detailed project descriptions

1) Improving performance of sequence search algorithms

Source code for parallel versions of BLAST are available. Evaluate parallel BLAST performance using

source code for MPI-BLAST
rewriting MPI-BLAST to use other parallel languages (UPC / SHMEM / OpenMP)
compare throughput vs parallel speedup (develop on-line scheduling algorithms depending on workload)
examine cache / memory access characteristics of bioinformatics software for possible improvements
evaluate bioinformatics software to determine whether bottlenecks are in CPU / cache / memory / I/O
determine whether hardware accelerators are still cost-effective vs. latest microprocessors (based on published results)

Links

SGI parallel bioinformatic evaluation here...
Parallel Smith-Waterman evaluation here...
MPI-BLAST benchmark results here...
DeCypher benchmark results here...

2) Evaluating sensitivity / specificity of sequence search algorithms

A number of papers have compared BLAST / FASTA / Smith-Waterman algorithms for discovering distant members of protein families. Repeat using

new protein families
new search algorithms (PHI-BLAST, MegaBlast, PatternHunter, GeneWise, etc.)
measurements of time/memory usage vs. database size
non-coding RNA sequences
your own ideas on improving search results
evaluate additional bioinformatic algorithms (clustering, gene prediction, etc.)

Links

Assessing Sequence Comparison Methods (PDF)...
Sensitivity and Selectivity in Protein Similarity Searches... (PDF)...
PatternHunter... (PDF)...
nonCoding RNAs (PDF)...

3) Preprocessing sequence databases

A number of researchers have suggested compressed sequence database formats. Investigate issues:

evaluate database performance (while compressed)
preprocess sequence database for BLAST searches
preprocess sequence database for useful information / motifs (most frequent 25-mers, etc.)
impact on genome-level comparisons / alignments

Links

A Compression Algorithm for DNA Sequences... (PDF)

Approach

Conduct survey of related work (read related research papers)
Write up a short description of proposed research before proceeding
Initially concentrate on setting up tools / procedures
Later focus on collecting experimental information
Present preliminary results on last day of class
Turn in short research paper describing project