Understanding the relationship between genetic sequence and biological function or phenotypes is a fundamental problem in biology. Knowledge of the sequence-function map is necessary for quantifying how a system evolves and for engineering novel molecules or organismal phenotypes. Recently, developed experimental technologies have allowed the collection of large numbers of sequence-function pairs. However, exploiting this data for biophysical insight is challenging, as the data is a sparse sampling of the combinatorially large space of possible sequences.
I introduce statistical models to infer biophysical quantities that are not directly observable in this type of data. I use a large set of promoter sequences and expression measurements in e.coli to infer interactions between proteins that bind to the DNA. Using a different heuristic approach, I infer fold stability from an thousands of mutated variants of green fluorescent protein, and show how enzymatic activity can be quantified in many variants of beta-lactamase. Finally, with a thermodynamic model I infer an energy landscape of a small bacterial protein and separate the effects of mutations on binding and folding stability from a high quality experiment with 500k variants.