Full Text PDF:

PDF icon Bloch_Cafl_JCTC2015.pdf

N. Blöchliger; A. Caflisch; A. Vitalis

Journal: J. Chem. Theory Comput.
Year: 2015
Volume: 11
Issue: 11
Pages: 5481-5492
DOI: 10.1021/acs.jctc.5b00618
Type of Publication: Journal Article

data analysis; Feature Selection; Feature Weighting; Progress Index; Protein Folding; Scalable algorithm; Time Series Data


Data mining techniques depend strongly on how the data are represented and how distance between samples is measured. High-dimensional data often contain a large number of irrelevant dimensions (features) for a given query. These features act as noise and obfuscate relevant information. Unsupervised approaches to mine such data require distance measures that can account for feature relevance. Molecular dynamics simulations produce high-dimensional data sets describing molecules observed in time. Here, we propose to globally or locally weight simulation features based on effective rates. This emphasizes, in a data-driven manner, slow degrees of freedom that often report on the metastable states sampled by the molecular system. We couple this idea to several unsupervised learning protocols. Our approach unmasks slow side chain dynamics within the native state of a miniprotein and reveals additional metastable conformations of a protein. The approach can be combined with most algorithms for clustering or dimensionality reduction.