rehh 2.0: a reimplementation of the R package rehh to detect positive selection from haplotype structure.

Abstract


Introduction
Next-generation sequencing (NGS) technologies have deeply transformed the nature of polymorphism data.Although population geneticists were, until recently, limited by the amount of available data in a handful of presumably independent markers, they now have access to dense single nucleotide polymorphism (SNP) data in both model and non-model species (Davey et al , 2011).In those species where genome assemblies are available, the analysis of haplotype structure in a population has proved useful to detect recent positive selection (Sabeti et al , 2002).Consider neutral mutations appearing in a population: if, by chance, any of these increases in frequency after some time, then recombination should tend to break down linkage disequilibrium (LD) around it, thereby decreasing the length of haplotypes on which this mutation stands.Common variants are therefore expected to be old and standing on short haplotypes.If a mutation is selected for, however, it should expand in the population before recombination has time to break down the haplotype on which it occurred.A powerful strategy to characterize candidate genes responding to natural or artificial positive selection thus consists in identifying genomic regions with unusually high local haplotype homozygosity, relatively to neutral expectation (Sabeti et al , 2002).
For that purpose, Sabeti et al (2002) introduced a new metric, referred to as the extended haplotype homozygosity (EHH), which measures the decay of identity by descent, as function of distance, between randomly sampled chromosomes carrying a focal SNP.Tests of departure of EHH from neutral expectation were proposed, based on coalescent simulations of demographic history.Voight et al (2006) later introduced a test statistic (iHS) based on the standardized log-ratio of the integrals of the observed decay of EHH computed for the ancestral and the derived alleles at the focal SNP.Finally, cross-population statistics were proposed, to contrast EHH profiles between populations: XP-EHH (Sabeti et al , 2007) and Rsb (Tang et al , 2007).These haplotype-based methods of detecting selection have largely been applied on human data (Vitti et al , 2013), a wide range of livestock (see, e.g.Flori et al , 2014;Bosse et al , 2015;Barson et al , 2015) and plant species (see, e.g.Wang et al , 2014;Jin et al , 2016), and also non-model species (see, e.g.Roesti et al , 2015;Mueller et al , 2016).
A few years ago, we developed rehh (Gautier & Vitalis, 2012), a package for the statistical software package R ( R Development Core Team, 2008), to detect recent positive selection from the analysis of longrange haplotypes.Since then, two alternative programs were released: selscan (Szpiech & Hernandez, peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/067629doi: bioRxiv preprint first posted online Aug. 3, 2016; 2014), which introduces multithreading to improve computational efficiency and hapbin (Maclean et al , 2015), which in addition to multithreading offers considerable gain in computation time thanks to a new computational approach based on a bitwise algorithm.
Here we propose a major upgrade of the rehh package (Gautier & Vitalis, 2012), which includes an improved algorithm to enumerate haplotypes, as well as multi-threading.These improvements decrease the computation time by more than an order of magnitude, as compared to the previous rehh version (1.13), which eases the analysis of big datasets.
Below we provide a brief overview of the statistics and tests available in rehh 2.0, and give a detailed worked example of the analysis of chromosome 2 in humans (HSA2), from HapMap samples CEU (Utah residents with Northern and Western European ancestry from the CEPH collection) and JPT+CHB (Japanese in Tokyo, Japan and Chinese from Beijing, China).We use this example as a guideline to use rehh2.0.We further show how rehh was improved since the previous version, and how it compares to the alternative programs selscan (Szpiech & Hernandez, 2014) and hapbin (Maclean et al , 2015).

Within population tests
The allele-specific extended haplotype homozygosity: EHH (Sabeti et al , 2002) At a focal SNP and for a given core allele (the ancestral or derived), the allele-specific extended haplotype homozygosity (EHH) is defined as the probability that two randomly chosen chromosomes (carrying the core allele considered) are identical by descent (IBD).IBD is assayed by computing homozygosity at all SNPs within an interval surrounding the core region (Sabeti et al , 2002).The EHH thus aims at measuring to which extent an extended haplotype is transmitted without recombination.In practice, the EHH (EHH as,t ) of a tested core allele a s (a s = 1 or a s = 2) for a focal SNP s over the chromosome interval comprised between the core allele a s and the SNP t is computed as: EHH as,t = 1 n as (n as − 1) peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/067629doi: bioRxiv preprint first posted online Aug. 3, 2016; where K as,t represents the number of different extended haplotypes (from SNP s to SNP t) carrying the core allele a s , n k is the number of the kth haplotype, and n as represents the number of haplotypes carrying the core allele a s , i.e., n as = Ka s ,t k=1 n k .
The integrated (allele-specific) EHH: iHH (Voight et al , 2006) By definition, irrespective of the allele considered, EHH starts at 1, and decays monotonically to 0 as one moves away from the focal SNP.For a given core allele, the integrated EHH (iHH) (Voight et al , 2006) is defined as the area under the EHH curve with respect to map position.In rehh (Gautier & Vitalis, 2012), this definite integral is computed using the trapezoidal rule.In practice, the integral is only computed for the regions of the curve above an arbitrarily small EHH value (e.g., EHH>0.05).In their seminal paper, Voight et al (2006) considered genetic distances and apply a penalty (proportional to physical distances) for successive SNPs separated by more than 20 kb.In addition, they did not compute iHH if any physical distance between a pair of neighboring SNPs was above 200 kb.We did not implement such an approach in rehh although this might easily be done by modifying the positions of the markers in SNP information input file.In addition, large gaps between successive SNPs (e.g., centromeres) might also be treated by splitting the chromosomes.For instance, when analyzing metacentric chromosomomes (e.g., HSA2), each chromosome arm may be considered separately in the analyses by assigning a different chromosome name (e.g., 2a and 2b) to the underlying SNPs.
The standardized ratio of core alleles iHH: iHS (Voight et al , 2006) Let UniHS represent the log-ratio of the iHH for its ancestral (iHH a ) and derived (iHH d ) alleles: The iHS of a given focal SNP s (iHS(s)) is then defined following (Voight et al , 2006) as: where µ ps UniHS and σ ps UniHS represent, respectively, the average and the standard deviation of the UniHS computed over all the SNPs with a derived allele frequency p s similar to that of the core SNP s.In practice, the derived allele frequencies are generally binned so that each bin is large enough (e.g., >10 peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not SNPs) to obtain reliable estimates of µ ps UniHS and σ ps UniHS .The iHS is constructed to have an approximately standard Gaussian distribution and to be comparable across SNPs regardless of their underlying allele frequencies.Hence, one may further transform iHS into p iHS (Gautier & Naves, 2011): where Φ (x) represents the Gaussian cumulative distribution function.Assuming most of the genotyped SNPs behave neutrally (i.e., that the genome-wide empirical iHS distribution is a fair approximation of the neutral distribution), p iHS may thus be interpreted as a two-sided p-value (in a − log 10 scale) associated with the null hypothesis of selective neutrality.

Pairwise-population tests
The site-specific extended haplotype homozygosity: EHHS (Tang et al , 2007;Sabeti et al , 2007) At a focal SNP, the site-specific extended haplotype homozygosity (EHHS) is defined as the probability that two randomly chosen chromosomes are IBD at all SNPs within an interval surrounding the core region.EHHS might roughly be viewed as linear combination of the EHH's for the two alternative alleles, with some weights depending on the corresponding allele frequencies.Two different EHHS estimators further referred to as EHHS Sabeti and EHHS Tang have been proposed by Sabeti et al (2007) and Tang et al (2007), respectively.For a focal SNP s over a chromosome interval extending to SNP t, these are computed as (using the same notation as above): where where: as is an estimator of the focal SNP heterozygosity peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/067629doi: bioRxiv preprint first posted online Aug. 3, 2016; k is an estimator of haplotype heterozygosity over the chromosome region interval extending from SNP s to SNP t.

The integrated EHHS: iES
As for the EHH (see above), EHHS starts at 1 and decays monotonically to 0 with increasing distance from the focal SNP.At a focal SNP, and in a similar fashion as the iHH, iES is defined as the integrated EHHS (Tang et al , 2007).Depending on the EHHS estimator considered, EHHS Sabeti or EHHS Tang , two different iES estimators, that we further refer to as iES Sabeti and iES Tang can be computed.
The standardized ratios of pairwise population iES: XP-EHH (Sabeti et al , 2007) and Rsb (Tang et al , 2007) For a given SNP s, let LRiES Sabeti (s) (respectively LRiES Tang (s)) represent the (unstandardized) log-ratio of the iES

Material and Methods
A new efficient algorithm to explore haplotype variability In the previous version of rehh (1.13) the distribution of haplotype counts for the entire interval from the core SNP to the distance x was computed for each x independently, entailing repeatedly the same calculations.In the new version of rehh (2.0), the distribution of haplotype counts for the interval from the core SNP to the distance x is updated consecutively from the distribution of haplotype counts corresponding to the interval between the core SNP and x − 1.The new algorithm doesn't affect the output, in particular, as in the previous version, all haplotypes carrying missing data are discarded from the computation of long-range homozygosity.

Human haplotype data
Two HSA2 haplotype data sets were downloaded from the HAPMAP project (phase III)

Computation
For comparison purposes, the different haplotype data sets were analyzed using the software packages rehh (both the previous version 1.13 and the new version 2.0), selscan (version 1.1.0b)(Szpiech & Hernandez, 2014) and hapbin (version 1.0.0)(Maclean et al , 2015).Default options were generally used except for the minimal threshold on the minor allele frequency (MAF) that was set to 0.01 for all programs.In addition, for the selscan program, both the window size around the core SNPs (--ehh-win option) and the maximum allowed gap in bp between two consecutive SNPs (--max-gap option) were set to 10 9 (this was made to disallow these options that are not considered in other programs).Similarly, the --max-extent option was inactivated by setting --max-extent=-1.For the hapbin programs (i.e., ihsbin and xpehhbin), the EHH and EHHS cut-off values (defined to stop the calculation of unstandardized iHS and iES) were set to 0.05 (i.e., the default value in selscan and rehh).For all programs, the standardization of iHS was performed with allele frequency bins of 0.01, as controlled by the freqbins Finally, for each analysis and parameter set, the estimation of computation times was averaged over ten independent runs.All analyses were run on a standard computer running under Linux Debian 8.5 and equipped with an Intel R Xeon R 6-core processor W3690 (3.46 GHz, 12M cache).Note that the Unix command taskset was used to control the number of working threads for the analyses with the hapbin programs (since neither the ihsbin nor the xpehhbin programs allow to chose the number of threads to be used).

Results and Discussion
Analysis of the human chromosome 2 data sets (XP-EHH=5.59).For this latter SNP (mapping to region #7 as defined below), the haplotype bifurcation diagrams for the ancestral and derived alleles within the CEU population are plotted in (Figure 1C) and (Figure 1D), respectively, using the bifurcation.diagramfunction from the rehh package.Note the extent of haplotype homozygosity associated with the derived allele (Figure 1C), relatively to that associated with the ancestral allele (Figure 1D), which is consistent with the negative iHS measure at this SNP (iHS=-3.24).
[Figure 1 about here.] To further identify regions displaying strong footprints of selection, we split the HSA2 chromosome into 950 consecutive 500 kb-windows (with a 250 kb overlap).Windows with at least 2 SNPs displaying a statistic > 4 (in absolute value that roughly corresponds to a two-sided p-value< 10 −4 , see above) for at least one of the four test statistics were deemed significant.Significant overlapping windows were then merged, leading to a total of 11 regions harboring strong signals of selection, which characteristics are detailed in Table 1 (see also Figure 1).As expected, most of the regions identified in previously published genome-scans for samples with the same origin (Sabeti et al , 2007;Tang et al , 2007;Voight et al , 2006) overlap with the regions identified here (Table 1).For instance, regions #6 and #7 that are in the vicinity of the EDAR gene (under selection in Asian populations) and the LCT gene (under selection in European populations), respectively, have been extensively characterized in the literature (e.g., Peter et al , 2012).
Interestingly, we detected more regions than previously reported in the aforementioned studies, most probably because our analyses are based on a larger dataset and different cut-off values.A more detailed description of the newly identified regions is however beyond the scope of the present article.
[Table 1 about here.] Note finally that XP-EHH-and Rsb-based scans gave consistent results, with the exception of the region in the vicinity of the LCT gene (#7 in Table 1 and Figure 1) where a double peak was observed with Rsb (consistent with the iHS profile within CEU) and a single peak with XP-EHH.The overall correlation between these statistics was equal to 0.843, which illustrates the close similarity of these two metrics.
Comparing the performances of rehh 2.0 relatively to rehh 1.13, selscan and hapbin packages The two CEU and JPT+CHB human data sets were further analyzed with rehh 1.13 to evaluate the gain in computation time resulting from the modifications introduced in version 2.0.Note that extensive tests were done during the development of version 2.0, to ensure that the same estimates were obtained with both versions.Only very marginal differences were however sometimes observed in the estimates of iES Tang .For instance, the correlation between the resulting Rsb computed across the CEU and JPT+CHB populations with version rehh 1.13 and rehh 2.0 was found equal to 0.999992 (instead of 1.0).This is actually due to the introduction of the computation of iES Sabeti in version 2.0 to estimate XP-EHH.Indeed, we chose to define the same cut-off value for both statistics during the computation of the component variable EHHS (controlled with the option limehhs, set to 0.05 by default).
peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/067629doi: bioRxiv preprint first posted online Aug. 3, 2016; version 1.13.Although a more detailed profiling of the algorithm would be required, these results suggest that computational burden is approximately linearly related to the data set complexity. To further improve computational speed, haplotype structure is now performed using OpenMP parallelization across SNPs in genome-wide scans.Using four threads then lead to an additional decrease of about 3.5 times in computation times (see Table 3).
Overall, the whole analysis of the HSA2 haplotype files used in this study took about 1.5 minutes (including the processing of input files) with rehh 2.0, and more than 1.3 hours with rehh 1.3.This corresponds to the computation of iHS within the CEU and within the JPT+CHB populations, as well as the computation of Rsb and XP-EHH.

about here.]
Comparing rehh 2.0 to the selscan and hapbin programs Finally, we compared rehh 2.0 with selscan (Szpiech & Hernandez, 2014) and hapbin (Maclean et al , 2015), which were recently published.Both programs are written in C++ language and include parallelisation.Computation times for the different analyses, either on a single or four threads, are provided in Table 3.The new version of rehh outperforms selscan by about one order of magnitude.Moreover, running rehh on a single thread is still more than twice as fast as running selscan on four threads.It should also be noticed that running a full analysis consisting of the estimation of iHS within and XP-EHH between the CEU and JPT+CHB populations result in a significant additional burden with selscan (Table 3).Conversely, hapbin was found to be more than five times faster than rehh 2.0, most likely as a result of its more efficient algorithm to explore haplotype variability.Yet, given the small computation times achieved by both programs, rehh 2.0 remains competitive relative to hapbin for most practical applications.
Correlation between the estimated iHS and XP-EHH obtained with the different programs are given in Table 4. Estimates of XP-EHH were in almost perfect agreement among the different software packages.
Similarly, estimates for iHS were almost the same between rehh 2.0 and selscan but slightly depart from those obtained with hapbin.Although we did not further investigate the origin of these discrepancies, this might probably be related to a different definition of haplotype homozygosity in hapbin, as compared to Sabeti et al (2007).
peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.

Conclusion
Although the R package rehh (Gautier & Vitalis, 2012) has been widely used since its first release, the increasing dimension of haplotype datasets typically available in most species led to serious limitations.This stimulated the development of alternative R-free solutions (Szpiech & Hernandez, 2014;Maclean et al , 2015).In this study, we introduced substantial changes in the rehh package to improve its computational efficiency by one to several orders of magnitude.This was achieved by modifying the processing of the input files and, most importantly, by improving and parallelizing the core algorithm that computes the distribution of haplotype counts.As a result, rehh 2.0 clearly outperforms the selscan (Szpiech & Hernandez, 2014) package and competes with hapbin (Maclean et al , 2015), the fastest program to date.A decisive advantage of rehh 2.0 over these programs is that it allows working within the multi-platform R environment.As such, it benefits from several graphical tools that facilitate visual interpretation of the results.rehh 2.0 is available from the CRAN repository (http://cran.r-project.org/web/packages/rehh/index.html).A help file together with a detailed vignette manual (the current version is provided as a Supplementary File S2) are included in the package.In A) and B), the horizontal dotted lines indicate the |iHS| significance threshold of 4 that was used to identify significant regions (see Table 1) and the arrows at the top of the graph indicate the mid-position of the significant regions described in Table 1).C) and D) Haplotype bifurcation diagrams drawn for the ancestral and derived allele, respectively, of the rs7377606 SNP in the CEU population (XP-EHH peak position of region #7 described in a As mentioned in the manual, rehh version 1.x is quite inefficient in allele recoding.Versions 1.x are also prone to error (e.g., if some alleles are coded as "T").

List of Figures
b using the new option haplotype.in.columns=T or p rSB : p xpEHH = − log 10 (1 − 2|Φ (xpEHH) − 0.5|) and p rSB = − log 10 (1 − 2|Φ (rSB) − 0.5|) (9) where Φ (x) represents the Gaussian cumulative distribution function.Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical distributions of XP-EHH and Rsb are fair approximations of their corresponding neutral distributions), p xpEHH and p rSB may thus be interpreted as a two-sided p-values (in a − log 10 scale) associated with a null hypothesis of selective neutrality.Alternatively, one may also compute p xpEHH or p rSB as: p xpEHH = − log 10 (Φ (xpEHH)) and p rSB = − log 10 (|Φ (rSB) |) (10) (see Gautier & Naves, 2011); p xpEHH and p rSB may then be interpreted as a one-sided p-values (in a − log 10 scale) allowing the identification of those sites displaying outstandingly high EHHS in population pop2 (represented in the denominator of the corresponding LRiES) relatively to the reference population (pop1).
argument in the ihh2ihs() function of the rehh package, and the bins argument for the program norm of the selscan package and the program ihsbin of the hapbin package.The command lines used for the different programs, together with the corresponding input data files are provided in the Supplementary Materials.

For
illustration purpose, we used rehh 2.0 to analyze two human data sets consisting of 236 and 342 haplotypes of 110,200 SNPs mapping to HSA2 that were sampled in the CEU and JPT+CHB populations respectively.The chromosome-wide scans of iHS for the CEU and the JPT+CHB populations, respectively, are plotted in Figure 1A.The most significant SNP map at position 136,503,121 bp for the CEU population (iHS=-5.35)and at position 111,506,728 bp for the JPT+CHB population (iHS=-4.92).The chromosome-wide scans of XP-EHH and Rsb, which contrast EHH profiles between the CEU and the JPT+CHB populations, are plotted in Figure 1B.The most significant SNP mapped at position 136,533,558 bp for Rsb-based test (Rsb=6.13)and at position 136,523,244 bp for the XP-EHH-based test

FIGURESFigure 1 .
FIGURES Tang et al (2007)tively med LRiES Tang ) and σ LRiES Sab (respectively σ LRiES Tang ) represent the median and standard deviation of the LRiES Sabeti (s) (respectively LRiES Tang (s)) computed over all the analyzed SNPs.As recommended byTang et al (2007), the median is used instead of the mean because it is less sensitive to extreme data points.As for the iHS (see above), XP-EHH and Rsb are constructed to have an approximately standard Gaussian distribution.They may further be transformed into p xpEHH peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
(The International HapMap3 Consortium, 2010)website (ftp://ftp.ncbi.nlm.nih.gov/hapmap/).They consisted of 236 haplotypes of 116,430 SNPs from the CEU and 342 haplotypes from the JPT+CHB populations, respectively.Further details about these data (including the phasing procedure) can be found on the HAPMAP website.For each SNP, the ancestral (and derived) allele was determined according to the Chimpanzee genome reference (using the dbsnp chimp B36.gff annotation file available at ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2010-08_phaseII+III/gff/).Such ancestral information is indeed required to carry out iHS-based tests (see above).As a result, 6,230 SNPs (5.35%) for which ancestral/derived states could not be unambiguously determined were discarded from further analyses leading to a total of 110,200 SNPs per analyzed haplotype.
about here.] Plot of XP-EHH (|XP-EHH| in blue) and Rsb (− |Rsb| in red) between the CEU and JPT+CHB populations.In A) and B), the horizontal dotted lines indicate the |iHS| significance threshold of 4 that was used to identify significant regions (see Table1)and the arrows at the top of the graph indicate the mid-position of the significant regions described in Table1).C) and D) Haplotype bifurcation diagrams drawn for the ancestral and derived allele, respectively, of the rs7377606 SNP in the CEU population (XP-EHH peak position of region #7 described in Table1and containing the LCT gene).In C) and D), the two grey vertical dotted lines delimit the LCT gene. . . . . . . . . . . . . . . . . .19 peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.

Table 2 .
Comparison of the computation times (in seconds) required to process input data files with the data2haplohh function for the versions 1.13 and 2.0 of the rehh package.Two data sets consisting, respectively, of 236 and 342 haplotypes of 110,200 SNPs for the CEU and JPT+CHB populations were considered (see the main text).For each of these datasets, the table gives the average computation times (± standard deviation) across ten independent runs, either with or without (in parentheses) allele recoding (using the option allele.recode).In rehh the function scanhh computes iHH and iES simultaneously.It therefore needs to be run only once per haplotype data set.As a result, computing XP-EHH (and/or Rsb) requires almost no extra time, once iHS for the two populations has been computed. a

Table 3 .
Comparison of the time (in seconds) required to compute the different EHH-based statistics for the versions 1.13 and 2.0 of the rehh, the selscan and the hapbin packages.For each analysis, the table gives the average computation time (± standard deviation) across ten independent runs.For each program, analyses were run either on a single thread or on four threads (except for rehh 1.13 version, which is not parallelised)

Table 4 .
Correlation between the estimated iHS and XP-EHH statistics across the programs rehh (version 2.0), selscan and hapbin.Pairwise correlation for the iHS computed in the CEU and the JPT+CHB (in parenthesis) populations are given in the upper diagonal.Pairwise correlation for the XP-EHH computed across the CEU and JPT+CHB populations are given in the lower diagonal.