Presented by: Joshua Baller
The yeast Saccharomyces cerevisiae is host to a number of transposons, small sequences of DNA that replicate and move within the chromosomes of their host. Two such transposons Ty1 and Ty5 (Tys), are members of the retrotransposon class of transposons. A signature of retrotransposons is the presence of reverse-transcriptase and integrase genes in their coding sequence. Reverse-transcriptase generates a cDNA copy of the transposon transcript while integrase inserts the cDNA copy into the genome. Previous studies have shown that Ty5 integrase interacts with the heterochromatin protein Sir4, creating a preference for insertion into Sir4 covered regions of the genome. Likewise Ty1 is hypothesized to interact with a component of the PolIII transcription complex resulting in insertion near PolIII genes. In both cases the observed distribution is not uniform over the suspected distribution of the interacting chromatin. This suggests the existence of secondary factors influencing insertion site preference. To identify these factors we have applied machine learning approaches. In the case of Ty5 we applied log linear classifiers to identify telomeres, Y’s and the area surrounding ARS consensus sequences as transposition hotspots. Additionally, we identified nucleosomes and ORFs as areas of decreased transposition. To validate our classification accuracy we used ROC analysis under 5-fold cross-validation. For Ty1 we applied regression models to relate various features to insertion frequency at PolIII genes. This work has verified a slight correlation with PolIII machinery. The features used by the classifier to discriminate between the two sets provide candidates for further benchtop research. Future work will improve feature selection and regularization in order to reduce the number of features identified to only a few key features.