PhD Proposal: Finding and Wielding Sparse Functional Motifs in Biopolymer Sequences
High throughput sequencing technologies have produced a massive proliferation of raw biopolymer sequence data of unknown structure and function. As a result, a large body of research has investigated the relationship between sequence, structure, and function, and there is great interest in methods for predicting structure and function from sequence data alone. Central to much of this research is the notion of evolutionary conservation: if a sequence region is functionally relevant, it will be preserved by natural selection and therefore become homologous.
Homologous regions are referred to in different contexts as conserved regions, functional motifs, and (local) alignments. We are primarily interested in functional motifs, which typically describe heavily conserved regions with specific and essential functions like transcription factor binding sites and protein-protein interaction domains. In general, functional motifs are much less common than local alignments, which may be valid and of biological interest despite occurring in only two sequences in a massive database. Motif finding algorithms therefore face a different search problem than algorithms focused on pairwise homologies and typically employ different search methods.
Our proposed research can be divided into two sections. First, we are interested in the problem of identifying undiscovered motifs, where we will focus on a class of sparse motifs that are likely to be overlooked by existing tools. We characterize the fundamental combinatorial challenges associated with searching for these motifs and discuss a previous finding that suggests that such regions are of biological relevance. Additionally, we will investigate the robustness of common local alignment search tools, which typically rely on short contiguous matches, and the potential for sparse motifs to improve their sensitivity.
Second, we will refine a previously developed method designed to improve performance on motif scaffolding, a protein design task where a complete, viable protein must be designed around a core functional motif. We identify a limitation of ESM3, a prominent generative protein design model capable of performing motif scaffolding, and suggest a potential solution. We present previous findings and challenges encountered during development. We will isolate the variables that explain our previous findings and proceed with further development of our methods. Finally, we will explore extensions to multimodal protein modeling incorporating predicted structure data.