Data Science

Background and Motivation

Recent times have seen a dramatic increase in the amount of data scientists generate and are able to store. In many disciplines, it has become more and more feasible to create meaningful data sets that are very large, both in the number of individual observations and the dimensionality thereof (for example, video microscopy or climate simulations). It is challenging to analyze these data because this fundamentally demands scalability of the algorithms employed. A second problem is that many scientific data deal with emerging phenomena: results that are not expected beforehand. The analysis must thus, at least in part, be unsupervised. We thus have been and are interested in algorithms that have both of these properties.

Our Approach

Molecular dynamics simulations are one source of large amounts of data, and in the beginning we have primarily been motivated by our own data to come up with creative ways of looking at them. We used (and still use) clustering algorithms to construct network-based visualizations of the free energy landscape in the conformational space of small proteins and so-called cut-based free energy profiles, one-dimensional projections containing information about barrier heights. These methods are tailored toward time series, and this is a focus we have maintained in more recent work. Our tree-based clustering algorithm, the progress index method, and its adaptation as a clustering scheme (known as SAPPHIRE-based clustering) are worth mentioning in this regard, and we now possess a rich toolbox of scalable and unsupervised data mining tools that we apply to data from completely different domains, most prominently in neuroscience to open data sets as, for example, in recent work by Cocina et al. Our growing interest in data science is also changing the way we think about scoring functions in our drug discovery pipeline.