Co-Variation Approaches to the Evolution of Protein Families

Co-variation in amino acid distribution of any two positions in the multiple sequence alignment (MSA) of a protein family is widely used to obtain structural and/or functional information [1]. It is assumed that, after mutation of one residue, structural/functional constraints can require a compensatory mutation of another residue to maintain the structure or to restore the function of the corresponding protein. Such compensatory mutations can depend on direct physical interactions between two residues or on an indirect interaction through intermediary residues or a ligand [2].


Introduction
Co-variation in amino acid distribution of any two positions in the multiple sequence alignment (MSA) of a protein family is widely used to obtain structural and/or functional information [1]. It is assumed that, after mutation of one residue, structural/functional constraints can require a compensatory mutation of another residue to maintain the structure or to restore the function of the corresponding protein. Such compensatory mutations can depend on direct physical interactions between two residues or on an indirect interaction through intermediary residues or a ligand [2].
The analysis of sequence co-variation in the MSA of a protein family is not straightforward because the sequences are not independent but phylogenetically related. In the MSA of a protein family containing several sub-families, sub-family independent co-varying positions are assumed to correspond to structurally important compensatory mutations. Subfamily dependent co-varying positions may arise from either compensatory mutations within one sub-family or independent mutations in several subfamilies. Both of these mechanisms lead to a phylogenetic bias.
Until now, the goal of most co-variation studies has been to predict contacts in order to gain structural information from co-evolving positions. To achieve this goal, different methods attempt to remove or reduce the phylogenetic bias [3][4][5] and to differentiate direct coupling from indirect coupling which is created by the transitivity effect of pair co-variation [2]. To this effect, Bayesian network [2], maximum entropy [6] and neural network [7] models markedly improved the prediction of interacting pairs, with successful de novo 3D protein structure prediction. Recently, co-evolution based methods coupled to machine learning have significantly improved contact prediction and constitute an area of intense research for de novo structure prediction [8]. The analysis of co-varying residues in an MSA has also been used to identify specificity-determining residues [9][10][11][12], protein sectors [13] or connectivity pathways [14][15][16].
Most comparative studies of co-variation methods [3,4,17] aim to find methods that optimize the number of contact pairs to gain structural information. Co-variation approaches that aim to gain information on protein divergence have received less attention. However, these approaches can be very useful to gain information on protein evolution and sub-family specificity. We thus undertook a comparative analysis of widely used co-variation methods on a model system to determine methods best suited to this aim [18].

Characteristics of the model system
To carry out this analysis, we chose the large family of rhodopsin-like G protein coupled receptors as a model system [18]. The human nonolfactory receptor set includes 283 receptors that can be classified into a dozen of sub-families [19] corresponding to three main evolutionary pathways [20]. The median sequence identity of human receptors varies from 20% in the full set to about 30% within sub-families (transmembrane domain only). We compared the different methods on three nested sequence sets. The sets correspond to (1) human non olfactory receptors, (2) receptors characterized by the P2.58 proline pattern (107 members, with 26% sequence identity). The latter receptors diverged after an indel in helix 2 which is one of the main evolutionary pathways of GPCRs [21] and (3) the chemokine receptors (23 members with 37% sequence identity). The characteristics of these three sets are frequently found in a variety of protein families.

Sequence co-variation methods
Most methods that measure co-variation in the amino acid distribution at any two positions in an MSA can be classified into four main classes: (1) Methods based on the χ 2 test, such as OMES (Observed minus Expected Squared) [17]. The OMES score is computed as: (2) Methods based on mutual information (MI), the probability of joint occurrence of events [22]. The MI score is given by: and p x, y (i, j) are the frequencies of amino acids x at position i, y at position j and of the pair (x,y) at positions (i,j). As mutual information formula favors pairs with high entropy, several corrections have been applied to correct this bias. The MIp score (mutual information product) outperforms other MI based methods [3] and is computed as: MI i j and MI < > are the average of MI(i,j) on, respectively, i, j and both i and j.
(3) Methods based on substitution matrices such as McBASC (McLachlan Based Substitution Correlation method) [23]. The McBASC score is computed as: (4) Methods using perturbation of an MSA, such as the statistical coupling analysis (SCA) [16] and Explicit Likelihood of Subset Covariation (ELSC) [25] methods. These latter two methods compare amino acid composition in a subset to the composition in the entire alignment. The SCA score is given by: The ELSC score measures how many possible subsets of size n would have the composition found in column j. It is computed as: is the binomial coefficient and N y (j), n y (j) and m y (j) are, respectively, the numbers of residues y at position j in the total (unperturbed) sequence alignment, in the subset alignment defined by the perturbation in column i and in the ideal subset (i.e., in a subset with the amino acid distribution equal to the total alignment).

Entropy biases
To compare the different methods [18], we analyzed co-variation scores as a function of sequence entropy which is a measure of variability based on the Shannon entropy [26] and is given by:  strong bias towards pairs with at least one highly variable position (blue dots on right top corner or along right side of the plot) and should thus be considered with caution. The other four methods favor pairs with intermediary conservation (blue dots in the center of the plots). However, with OMES and ELSC, the entropy of the top pairs is clearly separated from the entropy of the bottom pairs (red dots), which is not the case for the MIp and McBASC methods with more fuzzy plots. The plots in Figure 1 clearly indicate that each method favor pairs with different levels of sequence conservation. This entropy bias leads to different results for each method. Consequently, this bias must be taken into account and carefully analyzed upon selection of a method (to do so, we developed the R package bios2cor which includes a function for this bi-dimensional analysis [27]).

Networking biases
The network representation of the top pairs obtained with the OMES, ELSC, MIp and McBASC methods for Set 1 (Figure 2) highlights another striking difference between methods. MIp and McBASC methods favor pairs with low connectivity whereas the networks obtained with OMES and ELSC have a hub structure with a highly connected central residue. Most importantly, the central residue (P2.58) obtained with the OMES and ELSC methods is an important determinant of GPCR evolution [20,21]. This networking structure is observed for the three nested sequence sets that we have analyzed [18]. In each case, the central residue corresponds to a residue known to be crucial for the evolution of the GPCR family and the sub-family specificity (divergence of purinergic receptors in the set of P2.58 receptors and of inflammatory receptors in the chemokine receptor set). The hub structure observed for GPCRs strongly supports an epistasis model of protein evolution in which after mutation of a key residue, co-evolution of several residues was necessary to restore/shift protein function.

Selection of a co-variation method
The performance of co-variation analysis to answer specific questions crucially depends on (1) the co-variation method used and (2) the characteristics of the MSA (e.g., number of sequences, homogeneity and conservation) [3,4,17]. The first step in a co-variation analysis is thus to determine the specific questions of interest and then, to select a suited co-variation method and to prepare a sequence set accordingly.
When co-variation analysis aims to gain structural information, the methods favoring pairs with low connectivity are suited because these pairs are enriched in contact pairs [3,4]. This is the case for McBASC and MIp (Figure 2). This latter method was specifically developed with this aim. These methods require large sequence sets, with at least 100 sequences, as homogeneous as possible [3,17]. This last requirement should prompt to build MSA of orthologs whenever possible.
When co-variation analysis aims to gain evolutionary information, OMES and ELSC are well suited to identify the co-evolving residues that contributed to the divergence within a protein family. These methods favor pairs with intermediate conservation and highly connected residues and can work with a very small number of sequences (<25). They require a sequence set with two subsets of similar size and sequence similarity. Size requirement can be fulfilled by tailoring sequence sets with judicious use of paralogs and orthologs [18]. Homogeneity of the subsets is mandatory to avoid bias toward an overwhelming or a highly conserved sub-family.

Conclusion
The phylogenetic bias, related to the intrinsic inhomogeneity of a sequence set containing sub-families, represents a rich source of information on the mechanisms that drove the evolution of a protein family. Two co-variation methods, OMES and ELSC, are adequate to Positions correspond to nodes and co-variation signals to edges. The size of the nodes is proportional to their connectivity. The color indicates the entropy of the position from black for a fully conserved position to white for the most variable position. Node labels indicate the position with Ballesteros' numbering [29]. Dotted edges indicate distance below 8 Å. Adapted from [18]. mine evolutionary hubs, in relation with an epistasis-based mechanism of functional divergence within a protein family [28].

Software availability
The R package bios2cor is freely accessible at the Comprehensive R Archive Network (http://cran.r-project.org). It includes R-coded OMES, ELSC, MIp and McBASC functions as well as representation tools, such as the 2D entropy-score plots.