laboratoire de physique statistique
laboratoire de physique statistique


On the Entropy of Protein Families - Barton, John P. and Chakraborty, Arup K. and Cocco, Simona and Jacquin, Hugo and Monasson, Remi

Abstract : Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1- and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the mutation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.
Benchmarking Inverse Statistical Approaches for Protein Structure and Design with Exactly Solvable Models - Jacquin, Hugo and Gilson, Amy and Shakhnovich, Eugene and Cocco, Simona and Monasson, Remi

Abstract : Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of `true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons for the success of inverse approaches to the modelling of proteins from sequence data, and their limitations.
ACE: adaptive cluster expansion for maximum entropy graphical model inference - Barton, J. P. and De Leonardis, E. and Coucke, A. and Cocco, S.
BIOINFORMATICS 323089-3097 (2016) 

Abstract : Motivation: Graphical models are often employed to interpret patterns of correlations observed in data through a network of interactions between the variables. Recently, Ising/Potts models, also known as Markov random fields, have been productively applied to diverse problems in biology, including the prediction of structural contacts from protein sequence data and the description of neural activity patterns. However, inference of such models is a challenging computational problem that cannot be solved exactly. Here, we describe the adaptive cluster expansion (ACE) method to quickly and accurately infer Ising or Potts models based on correlation data. ACE avoids overfitting by constructing a sparse network of interactions sufficient to reproduce the observed correlation data within the statistical error expected due to finite sampling. When convergence of the ACE algorithm is slow, we combine it with a Boltzmann Machine Learning algorithm (BML). We illustrate this method on a variety of biological and artificial datasets and compare it to state-of-the-art approximate methods such as Gaussian and pseudo-likelihood inference. Results: We show that ACE accurately reproduces the true parameters of the underlying model when they are known, and yields accurate statistical descriptions of both biological and artificial data. Models inferred by ACE more accurately describe the statistics of the data, including both the constrained low-order correlations and unconstrained higher-order correlations, compared to those obtained by faster Gaussian and pseudo-likelihood methods. These alternative approaches can recover the structure of the interaction network but typically not the correct strength of interactions, resulting in less accurate generative models.
Direct coevolutionary couplings reflect biophysical residue interactions in proteins - Coucke, Alice and Uguzzoni, Guido and Oteri, Francesco and Cocco, Simona and Monasson, Remi and Weigt, Martin

Abstract : Coevolution of residues in contact imposes strong statistical constraints on the sequence variability between homologous proteins. Direct-Coupling Analysis (DCA), a global statistical inference method, successfully models this variability across homologous protein families to infer structural information about proteins. For each residue pair, DCA infers 21 x 21 matrices describing the coevolutionary coupling for each pair of amino acids (or gaps). To achieve the residue-residue contact prediction, these matrices are mapped onto simple scalar parameters; the full information they contain gets lost. Here, we perform a detailed spectral analysis of the coupling matrices resulting from 70 protein families, to show that they contain quantitative information about the physico-chemical properties of amino-acid interactions. Results for protein families are corroborated by the analysis of synthetic data from lattice-protein models, which emphasizes the critical effect of sampling quality and regularization on the biochemical features of the statistical coupling matrices. Published by AIP Publishing.
Neural assemblies revealed by inferred connectivity-based models of prefrontal cortex recordings - Tavoni, G. and Cocco, S. and Monasson, R.

Abstract : We present two graphical model-based approaches to analyse the distribution of neural activities in the prefrontal cortex of behaving rats. The first method aims at identifying cell assemblies, groups of synchronously activating neurons possibly representing the units of neural coding and memory. A graphical (Ising) model distribution of snapshots of the neural activities, with an effective connectivity matrix reproducing the correlation statistics, is inferred from multi-electrode recordings, and then simulated in the presence of a virtual external drive, favoring high activity (multi-neuron) configurations. As the drive increases groups of neurons may activate together, and reveal the existence of cell assemblies. The identified groups are then showed to strongly coactivate in the neural spiking data and to be highly specific of the inferred connectivity network, which offers a sparse representation of the correlation pattern across neural cells. The second method relies on the inference of a Generalized Linear Model, in which spiking events are integrated over time by neurons through an effective connectivity matrix. The functional connectivity matrices inferred with the two approaches are compared. Sampling of the inferred GLM distribution allows us to study the spatio-temporal patterns of activation of neurons within the identified cell assemblies, particularly their activation order: the prevalence of one order with respect to the others is weak and reflects the neuron average firing rates and the strength of the largest effective connections. Other properties of the identified cell assemblies (spatial distribution of coactivation events and firing rates of coactivating neurons) are discussed.
Protein and RNA Structure Prediction by Integration of Co-Evolutionary Information into Molecular Simulation - De Leonardis, Eleonora and Lutz, Benjamin and Cocco, Simona and Monasson, Remi and Szurmant, Hendrik and Weigt, Martin and Schug, Alexander
Learning Probabilities From Random Observables in High Dimensions: The Maximum Entropy Distribution and Others - Obuchi, Tomoyuki and Cocco, Simona and Monasson, Remi

Abstract : We consider the problem of learning a target probability distribution over a set of N binary variables from the knowledge of the expectation values (with this target distribution) of M observables, drawn uniformly at random. The space of all probability distributions compatible with these M expectation values within some fixed accuracy, called version space, is studied. We introduce a biased measure over the version space, which gives a boost increasing exponentially with the entropy of the distributions and with an arbitrary inverse `temperature' . The choice of allows us to interpolate smoothly between the unbiased measure over all distributions in the version space () and the pointwise measure concentrated at the maximum entropy distribution (). Using the replica method we compute the volume of the version space and other quantities of interest, such as the distance R between the target distribution and the center-of-mass distribution over the version space, as functions of and for large N. Phase transitions at critical values of are found, corresponding to qualitative improvements in the learning of the target distribution and to the decrease of the distance R. However, for fixed , the distance R does not vary with , which means that the maximum entropy distribution is not closer to the target distribution than any other distribution compatible with the observable values. Our results are confirmed by Monte Carlo sampling of the version space for small system sizes ().
Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction - De Leonardis, Eleonora and Lutz, Benjamin and Ratz, Sebastian and Cocco, Simona and Monasson, Remi and Schug, Alexander and Weigt, Martin
NUCLEIC ACIDS RESEARCH 4310444-10455 (2015) 

Abstract : Despite the biological importance of non-coding RNA, their structural characterization remains challenging. Making use of the rapidly growing sequence databases, we analyze nucleotide coevolution across homologous sequences via Direct-Coupling Analysis to detect nucleotide-nucleotide contacts. For a representative set of riboswitches, we show that the results of Direct-Coupling Analysis in combination with a generalized Nussinov algorithm systematically improve the results of RNA secondary structure prediction beyond traditional covariance approaches based on mutual information. Even more importantly, we show that the results of Direct-Coupling Analysis are enriched in tertiary structure contacts. By integrating these predictions into molecular modeling tools, systematically improved tertiary structure predictions can be obtained, as compared to using secondary structure information alone.
Distinguishing the immunostimulatory properties of noncoding RNAs expressed in cancer cells - Tanne, Antoine and Muniz, Luciana R. and Puzio-Kuter, Anna and Leonova, Katerina I. and Gudkov, Andrei V. and Ting, David T. and Monasson, Remi and Cocco, Simona and Levine, Arnold J. and Bhardwaj, Nina and Greenbaum, Benjamin D.

Abstract : Recent studies have demonstrated abundant transcription of a set of noncoding RNAs (ncRNAs) preferentially within tumors as opposed to normal tissue. Using an approach from statistical physics, we quantify global transcriptome-wide motif use for the first time, to our knowledge, in human and murine ncRNAs, determining that most have motif use consistent with the coding genome. However, an outlier subset of tumor-associated ncRNAs, typically of recent evolutionary origin, has motif use that is often indicative of pathogen-associated RNA. For instance, we show that the tumor-associated human repeat human satellite repeat II (HSATII) is enriched in motifs containing CpG dinucleotides in AU-rich contexts that most of the human genome and human adapted viruses have evolved to avoid. We demonstrate that a key subset of these ncRNAs functions as immunostimulatory ``self-agonists'' and directly activates cells of the mononuclear phagocytic system to produce proinflammatory cytokines. These ncRNAs arise from endogenous repetitive elements that are normally silenced, yet are often very highly expressed in cancers. We propose that the innate response in tumors may partially originate from direct interaction of immunogenic ncRNAs expressed in cancer cells with innate pattern recognition receptors, and thereby assign a previously unidentified danger-associated function to a set of dark matter repetitive elements. These findings potentially reconcile several observations concerning the role of ncRNA expression in cancers and their relationship to the tumor microenvironment.
Reconstruction and Identification of DNA Sequence Landscapes from Unzipping Experiments at Equilibrium - Barbieri, Carlo and Cocco, Simona and Jorg, Thomas and Monasson, Remi
BIOPHYSICAL JOURNAL 106430-439 (2014) 

Abstract : Two methods for reconstructing the free-energy landscape of a DNA molecule from the knowledge of the equilibrium unzipping force versus extension signal are introduced: a simple and fast procedure, based on a parametric representation of the experimental force signal, and a maximum-likelihood inference of coarse-grained free-energy parameters. In addition, we propose a force alignment procedure to correct for the drift in the experimental measure of the opening position, a major source of error. For unzipping data obtained by Huguet et al., the reconstructed basepair (bp) free energies agree with the running average of the true free energies on a 20-50 bp scale, depending on the region in the sequence. Features of the landscape at a smaller scale (5-10 bp) could be recovered in favorable regions at the beginning of the molecule. Based on the analysis of synthetic data corresponding to the 16S rDNA gene of bacteria, we show that our approach could be used to identify specific DNA sequences among thousands of homologous sequences in a database.
Quantitative theory of entropic forces acting on constrained nucleotide sequences applied to viruses - Greenbaum, Benjamin D. and Cocco, Simona and Levine, Arnold J. and Monasson, Remi

Abstract : We outline a theory to quantify the interplay of entropic and selective forces on nucleotide organization and apply it to the genomes of single-stranded RNA viruses. We quantify these forces as intensive variables that can easily be compared between sequences, outline a computationally efficient transfer-matrix method for their calculation, and apply this method to influenza and HIV viruses. We find viruses altering their dinucleotide motif use under selective forces, with these forces on CpG dinucleotides growing stronger in influenza the longer it replicates in humans. For a subset of genes in the human genome, many involved in antiviral innate immunity, the forces acting on CpG dinucleotides are even greater than the forces observed in viruses, suggesting that both effects are in response to similar selective forces involving the innate immune system. We further find that the dynamics of entropic forces balancing selective forces can be used to predict how long it will take a virus to adapt to a new host, and that it would take H1N1 several centuries to adapt to humans from birds, typically contributing many of its synonymous substitutions to the forcible removal of CpG dinucleotides. By examining the probability landscape of dinucleotide motifs, we predict where motifs are likely to appear using only a single-force parameter and uncover the localization of UpU motifs in HIV. Essentially, we extend the natural language and concepts of statistical physics, such as entropy and conjugated forces, to understanding viral sequences and, more generally, constrained genome evolution.
Stochastic Ratchet Mechanisms for Replacement of Proteins Bound to DNA - Cocco, S. and Marko, J. F. and Monasson, R.

Abstract : Experiments indicate that unbinding rates of proteins from DNA can depend on the concentration of proteins in nearby solution. Here we present a theory of multistep replacement of DNA-bound proteins by solution-phase proteins. For four different kinetic scenarios we calculate the dependence of protein unbinding and replacement rates on solution protein concentration. We find (1) strong effects of progressive ``rezipping'' of the solution-phase protein onto DNA sites liberated by ``unzipping'' of the originally bound protein, (2) that a model in which solution-phase proteins bind nonspecifically to DNA can describe experiments on exchanges between the nonspecific DNA-binding proteins Fis-Fis and Fis-HU, and (3) that a binding specific model describes experiments on the exchange of CueR proteins on specific binding sites.
Large pseudocounts and L-2-norm penalties are necessary for the mean-field inference of Ising and Potts models - Barton, J. P. and Cocco, S. and De Leonardis, E. and Monasson, R.

Abstract : The mean-field (MF) approximation offers a simple, fast way to infer direct interactions between elements in a network of correlated variables, a common, computationally challenging problem with practical applications in fields ranging from physics and biology to the social sciences. However, MF methods achieve their best performance with strong regularization, well beyond Bayesian expectations, an empirical fact that is poorly understood. In this work, we study the influence of pseudocount and L-2-norm regularization schemes on the quality of inferred Ising or Potts interaction networks from correlation data within the MF approximation. We argue, based on the analysis of small systems, that the optimal value of the regularization strength remains finite even if the sampling noise tends to zero, in order to correct for systematic biases introduced by the MF approximation. Our claim is corroborated by extensive numerical studies of diverse model systems and by the analytical study of the m-component spin model for large but finite m. Additionally, we find that pseudocount regularization is robust against sampling noise and often outperforms L-2-norm regularization, particularly when the underlying network of interactions is strongly heterogeneous. Much better performances are generally obtained for the Ising model than for the Potts model, for which only couplings incoming onto medium-frequency symbols are reliably inferred.
Ising models for neural activity inferred via selective cluster expansion: structural and coding properties - Barton, John and Cocco, Simona

Abstract : We describe the selective cluster expansion (SCE) of the entropy, a method for inferring an Ising model which describes the correlated activity of populations of neurons. We re-analyze data obtained from multielectrode recordings performed in vitro on the retina and in vivo on the prefrontal cortex. Recorded population sizes N range from N = 37 to 117 neurons. We compare the SCE method with the simplest mean field methods (corresponding to a Gaussian model) and with regularizations which favor sparse networks (L-1 norm) or penalize large couplings (L-2 norm). The network of the strongest interactions inferred via mean field methods generally agree with those obtained from SCE. Reconstruction of the sampled moments of the distributions, corresponding to neuron spiking frequencies and pairwise correlations, and the prediction of higher moments including three-cell correlations and multi-neuron firing frequencies, is more difficult than determining the large-scale structure of the interaction network, and, apart from a cortical recording in which the measured correlation indices are small, these goals are achieved with the SCE but not with mean field approaches. We also find differences in the inferred structure of retinal and cortical networks: inferred interactions tend to be more irregular and sparse for cortical data than for retinal data. This result may reflect the structure of the recording. As a consequence, the SCE is more effective for retinal data when expanding the entropy with respect to a mean field reference S - S-MF, while expansions of the entropy S alone perform better for cortical data.
From Principal Component to Direct Coupling Analysis of Coevolution in Proteins: Low-Eigenvalue Modes are Needed for Structure Prediction - Cocco, Simona and Monasson, Remi and Weigt, Martin

Abstract : Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant `patterns' of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.
Trend and fluctuations: Analysis and design of population dynamics measurements in replicate ecosystems - Hekstra, Doeke R. and Cocco, Simona and Monasson, Remi and Leibler, Stanislas

Abstract : The dynamical evolution of complex systems is often intrinsically stochastic and subject to external random forces. In order to study the resulting variability in dynamics, it is essential to make measurements on replicate systems and to separate arbitrary variation of the average dynamics of these replicates from fluctuations around the average dynamics. Here we do so for population time-series data from replicate ecosystems sharing a common average dynamics or common trend. We explain how model parameters, including the effective interactions between species and dynamical noise, can be estimated from the data and how replication reduces errors in these estimates. For this, it is essential that the model can fit a variety of average dynamics. We then show how one can judge the quality of a model, compare alternate models, and determine which combinations of parameters are poorly determined by the data. In addition we show how replicate population dynamics experiments could be designed to optimize the acquired information of interest about the systems. Our approach is illustrated on a set of time series gathered from replicate microbial closed ecosystems.
Adaptive Cluster Expansion for the Inverse Ising Problem: Convergence, Algorithm and Tests - Cocco, S. and Monasson, R.

Abstract : We present a procedure to solve the inverse Ising problem, that is, to find the interactions between a set of binary variables from the measure of their equilibrium correlations. The method consists in constructing and selecting specific clusters of spins, based on their contributions to the cross-entropy of the Ising model. Small contributions are discarded to avoid overfitting and to make the computation tractable. The properties of the cluster expansion and its performances on synthetic data are studied. To make the implementation easier we give the pseudo-code of the algorithm.
Adaptive Cluster Expansion for Inferring Boltzmann Machines with Noisy Data - Cocco, S. and Monasson, R.

Abstract : We introduce a procedure to infer the interactions among a set of binary variables, based on their sampled frequencies and pairwise correlations. The algorithm builds the clusters of variables contributing most to the entropy of the inferred Ising model and rejects the small contributions due to the sampling noise. Our procedure successfully recovers benchmark Ising models even at criticality and in the low temperature phase, and is applied to neurobiological data.
On the trajectories and performance of Infotaxis, an information-based greedy search algorithm - Barbieri, C. and Cocco, S. and Monasson, R.
EPL 94 (2011) 

Abstract : We present a continuous-space version of Infotaxis, a search algorithm where a searcher greedily moves to maximize the gain in information about the position of the target to be found. Using a combination of analytical and numerical tools we study the nature of the trajectories in two and three dimensions. The probability that the search is successful and the running time of the search are estimated. A possible extension to non-greedy search is suggested. Copyright (C) EPLA, 2011
High-dimensional inference with the generalized Hopfield model: Principal component analysis and corrections - Cocco, S. and Monasson, R. and Sessak, V.

Abstract : We consider the problem of inferring the interactions between a set of N binary variables from the knowledge of their frequencies and pairwise correlations. The inference framework is based on the Hopfield model, a special case of the Ising model where the interaction matrix is defined through a set of patterns in the variable space, and is of rank much smaller than N. We show that maximum likelihood inference is deeply related to principal component analysis when the amplitude of the pattern components xi is negligible compared to v root N. Using techniques from statistical mechanics, we calculate the corrections to the patterns to the first order in xi/root N. We stress the need to generalize the Hopfield model and include both attractive and repulsive patterns in order to correctly infer networks with sparse and strong interactions. We present a simple geometrical criterion to decide how many attractive and repulsive patterns should be considered as a function of the sampling noise. We moreover discuss how many sampled configurations are required for a good inference, as a function of the system size N and of the amplitude xi. The inference approach is illustrated on synthetic and biological data.