DOI

1

repgenHMM: a dynamic programming tool to infer the rules of immune receptor generation from sequence data - Elhanati, Yuval and Marcou, Quentin and Mora, Thierry and Walczak, Aleksandra M.

BIOINFORMATICS 32, 1943-1951 (2016)

Abstract : Motivation: The diversity of the immune repertoire is initially generated by random rearrangements of the receptor gene during early T and B cell development. Rearrangement scenarios are composed of random events-choices of gene templates, base pair deletions and insertions-described by probability distributions. Not all scenarios are equally likely, and the same receptor sequence may be obtained in several different ways. Quantifying the distribution of these rearrangements is an essential baseline for studying the immune system diversity. Inferring the properties of the distributions from receptor sequences is a computationally hard problem, requiring enumerating every possible scenario for every sampled receptor sequence. Results: We present a Hidden Markov model, which accounts for all plausible scenarios that can generate the receptor sequences. We developed and implemented a method based on the Baum-Welch algorithm that can efficiently infer the parameters for the different events of the rearrangement process. We tested our software tool on sequence data for both the alpha and beta chains of the T cell receptor. To test the validity of our algorithm, we also generated synthetic sequences produced by a known model, and confirmed that its parameters could be accurately inferred back from the sequences. The inferred model can be used to generate synthetic sequences, to calculate the probability of generation of any receptor sequence, as well as the theoretical diversity of the repertoire. We estimate this diversity to be approximate to 10(23) for human T cells. The model gives a baseline to investigate the selection and dynamics of immune repertoires.

BIOINFORMATICS 32, 1943-1951 (2016)

Abstract : Motivation: The diversity of the immune repertoire is initially generated by random rearrangements of the receptor gene during early T and B cell development. Rearrangement scenarios are composed of random events-choices of gene templates, base pair deletions and insertions-described by probability distributions. Not all scenarios are equally likely, and the same receptor sequence may be obtained in several different ways. Quantifying the distribution of these rearrangements is an essential baseline for studying the immune system diversity. Inferring the properties of the distributions from receptor sequences is a computationally hard problem, requiring enumerating every possible scenario for every sampled receptor sequence. Results: We present a Hidden Markov model, which accounts for all plausible scenarios that can generate the receptor sequences. We developed and implemented a method based on the Baum-Welch algorithm that can efficiently infer the parameters for the different events of the rearrangement process. We tested our software tool on sequence data for both the alpha and beta chains of the T cell receptor. To test the validity of our algorithm, we also generated synthetic sequences produced by a known model, and confirmed that its parameters could be accurately inferred back from the sequences. The inferred model can be used to generate synthetic sequences, to calculate the probability of generation of any receptor sequence, as well as the theoretical diversity of the repertoire. We estimate this diversity to be approximate to 10(23) for human T cells. The model gives a baseline to investigate the selection and dynamics of immune repertoires.

DOI

2

ACE: adaptive cluster expansion for maximum entropy graphical model inference - Barton, J. P. and De Leonardis, E. and Coucke, A. and Cocco, S.

BIOINFORMATICS 32, 3089-3097 (2016)

Abstract : Motivation: Graphical models are often employed to interpret patterns of correlations observed in data through a network of interactions between the variables. Recently, Ising/Potts models, also known as Markov random fields, have been productively applied to diverse problems in biology, including the prediction of structural contacts from protein sequence data and the description of neural activity patterns. However, inference of such models is a challenging computational problem that cannot be solved exactly. Here, we describe the adaptive cluster expansion (ACE) method to quickly and accurately infer Ising or Potts models based on correlation data. ACE avoids overfitting by constructing a sparse network of interactions sufficient to reproduce the observed correlation data within the statistical error expected due to finite sampling. When convergence of the ACE algorithm is slow, we combine it with a Boltzmann Machine Learning algorithm (BML). We illustrate this method on a variety of biological and artificial datasets and compare it to state-of-the-art approximate methods such as Gaussian and pseudo-likelihood inference. Results: We show that ACE accurately reproduces the true parameters of the underlying model when they are known, and yields accurate statistical descriptions of both biological and artificial data. Models inferred by ACE more accurately describe the statistics of the data, including both the constrained low-order correlations and unconstrained higher-order correlations, compared to those obtained by faster Gaussian and pseudo-likelihood methods. These alternative approaches can recover the structure of the interaction network but typically not the correct strength of interactions, resulting in less accurate generative models.

BIOINFORMATICS 32, 3089-3097 (2016)

Abstract : Motivation: Graphical models are often employed to interpret patterns of correlations observed in data through a network of interactions between the variables. Recently, Ising/Potts models, also known as Markov random fields, have been productively applied to diverse problems in biology, including the prediction of structural contacts from protein sequence data and the description of neural activity patterns. However, inference of such models is a challenging computational problem that cannot be solved exactly. Here, we describe the adaptive cluster expansion (ACE) method to quickly and accurately infer Ising or Potts models based on correlation data. ACE avoids overfitting by constructing a sparse network of interactions sufficient to reproduce the observed correlation data within the statistical error expected due to finite sampling. When convergence of the ACE algorithm is slow, we combine it with a Boltzmann Machine Learning algorithm (BML). We illustrate this method on a variety of biological and artificial datasets and compare it to state-of-the-art approximate methods such as Gaussian and pseudo-likelihood inference. Results: We show that ACE accurately reproduces the true parameters of the underlying model when they are known, and yields accurate statistical descriptions of both biological and artificial data. Models inferred by ACE more accurately describe the statistics of the data, including both the constrained low-order correlations and unconstrained higher-order correlations, compared to those obtained by faster Gaussian and pseudo-likelihood methods. These alternative approaches can recover the structure of the interaction network but typically not the correct strength of interactions, resulting in less accurate generative models.

DOI

3

Capturing coevolutionary signals in repeat proteins - Espada, Rocio and Parra, R. Gonzalo and Mora, Thierry and Walczak, Aleksandra M. and Ferreiro, Diego U.

BMC BIOINFORMATICS 16, (2015)

Abstract : Background: The analysis of correlations of amino acid occurrences in globular domains has led to the development of statistical tools that can identify native contacts - portions of the chains that come to close distance in folded structural ensembles. Here we introduce a direct coupling analysis for repeat proteins - natural systems for which the identification of folding domains remains challenging. Results: We show that the inherent translational symmetry of repeat protein sequences introduces a strong bias in the pair correlations at precisely the length scale of the repeat-unit. Equalizing for this bias in an objective way reveals true co-evolutionary signals from which local native contacts can be identified. Importantly, parameter values obtained for all other interactions are not significantly affected by the equalization. We quantify the robustness of the procedure and assign confidence levels to the interactions, identifying the minimum number of sequences needed to extract evolutionary information in several repeat protein families. Conclusions: The overall procedure can be used to reconstruct the interactions at distances larger than repeat-pairs, identifying the characteristics of the strongest couplings in each family, and can be applied to any system that appears translationally symmetric.

BMC BIOINFORMATICS 16, (2015)

Abstract : Background: The analysis of correlations of amino acid occurrences in globular domains has led to the development of statistical tools that can identify native contacts - portions of the chains that come to close distance in folded structural ensembles. Here we introduce a direct coupling analysis for repeat proteins - natural systems for which the identification of folding domains remains challenging. Results: We show that the inherent translational symmetry of repeat protein sequences introduces a strong bias in the pair correlations at precisely the length scale of the repeat-unit. Equalizing for this bias in an objective way reveals true co-evolutionary signals from which local native contacts can be identified. Importantly, parameter values obtained for all other interactions are not significantly affected by the equalization. We quantify the robustness of the procedure and assign confidence levels to the interactions, identifying the minimum number of sequences needed to extract evolutionary information in several repeat protein families. Conclusions: The overall procedure can be used to reconstruct the interactions at distances larger than repeat-pairs, identifying the characteristics of the strongest couplings in each family, and can be applied to any system that appears translationally symmetric.