Optimization of sampling designs for pedigrees and association studies

In many studies, related individuals are phenotyped in order to infer how their genotype contributes to their phenotype, through the estimation of parameters such as breeding values or locus effects. When it is not possible to phenotype all the individuals, it is important to properly sample the population to improve the precision of the statistical analysis. This article studies how to optimize such sampling designs for pedigrees and association studies. Two sampling methods are developed, stratified sampling and D optimality. It is found that it is important to take account of mutation when sampling pedigrees with many generations: as the size of mutation effects increases, optimized designs sample more individuals in late generations. Optimized designs for association studies tend to improve the joint estimation of breeding values and locus effects, all the more as sample size is low and the genetic architecture of the trait is simple. When the trait is determined by few loci, they are reminiscent of classical experimental designs for regression models and tend to select homozygous individuals. When the trait is determined by many loci, locus effects may be difficult to estimate, even if an optimized design is used.


INTRODUCTION
Quantitative genetics studies the inheritance of quantitative traits such as height or weight, which vary continuously. Experimental studies in quantitative genetics use phenotypic and genetic data to address various fundamental or applied issues such as the selection of plants or animals in breeding programs (Jannink et al., 2010), the identification of individuals at increased risk of disease (Vazquez et al., 2012), the evaluation of genetic resources stored in gene banks (Crossa et al., 2016), the inference of the genetic architecture of traits in association studies (Barton et al., 2007, Chap. 14;Rau et al., 2019), the study of the genetic mechanisms underlying the response to selection (Durand et al., 2010(Durand et al., , 2015 or the inference of the strength of evolutionary forces such as drift, mutation or selection (Gillespie, 1998, Chap. 5;Barton et al., 2007, Chap. 19;David et al., 2020). Such studies require a sampling step and the choice of a sampling strategy when it is not possible to phenotype all the individuals of the studied population (Cochran, 1977).
Several studies have investigated how to optimize sampling designs for genomic prediction (Isidro et al., 2015;Rincent et al., 2017;Akdemir and Isidro-Sánchez, 2019). In such experiments, the population is genotyped with genome-wide markers. The individuals belonging to a candidate set are available for phenotyping, while the other individuals cannot be phenotyped because they may be dead, for example. Since phenotyping is costly, some individuals are sampled in the candidate set and are phenotyped. The genetic and phenotypic data are then used to predict the breeding values of target individuals. Designing such surveys consists of choosing the individuals to be phenotyped in the candidate set in order to predict the target breeding values as precisely as possible.
Sampling designs for genomic prediction have been generated using stratified sampling (SS) (Isidro et al., 2015). This method first divides the population into homogeneous subpopulations by clustering the genetic data. Individuals are then sampled in each subpopulation. This method accounts for the population structure and ensures a high genetic diversity in the sample. Alternatively, sampling designs for genomic prediction have been optimized using optimal design theory (Atkinson and Donev, 1992;Chaloner and Verdinelli, 1995;Isidro et al., 2015;Rincent et al., 2017;Akdemir and Isidro-Sánchez, 2019). In this approach, a criterion that quantifies the merit of a design is first defined, for example, a criterion that quantifies the variance of the estimators of interest. Efficient designs are then generated by optimizing this criterion over an appropriate class of designs. Design criteria may be optimized using combinatorial optimization methods such as genetic algorithms or simulated annealing (Reeves, 1993;David et al., 2000;Van Groenigen, 2000;Müller et al., 2004;Amzal et al., 2006;Brus and Heuvelink, 2007;Butler et al., 2014;Del Moral and Doucet, 2014;Akdemir and Isidro-Sánchez, 2019). In these studies, optimized designs generally performed better than simple random sampling (SRS) and improved the precision of genomic prediction.
Quantitative genetics includes other types of experiments than genomic prediction, for example, surveys involving pedigrees or association studies. It is not clear how to sample pedigrees, if more individuals should be sampled in early or in late generations, and how mating system and mutation influence sampling designs. The experimental setup of an association study is similar to that of genomic prediction, however, their objective is to estimate the effects of loci on traits rather than breeding values. Likewise, it is not clear how to sample individuals in association studies and how the genetic architecture of the trait influences sampling designs.
This article develops sampling methods for pedigrees and association studies. Observations are assumed to follow a classical infinitesimal model, in which the phenotype can be partitioned into independent genetic and environmental components (Section 2). Two sampling methods are developed, SS and optimality ( O) (Section 3). A real maize experiment is used to study how to sample pedigrees, and the influence of mating system and mutation on sampling (Section 4). Finally, simulations are carried out to study how to sample individuals in association studies, and the influence of sample size and the genetic architecture of the trait on sampling (Section 5).

MODELS
We consider a population of diploid individuals. This population contains a candidate set , whose individuals are available for phenotyping for some quantitative trait. A sample of < individuals to be phenotyped is chosen in . Let be the × incidence matrix whose entry in row and column satisfies = 1 if the th observation is collected on individual , and = 0 otherwise. For example, if the second observation is collected on individual 69, then 2 69 = 1 and 2 = 0 for ≠ 69.
Trait values are assumed to follow an infinitesimal model (Wray, 1990;Gillespie, 1998;Lynch and Walsh, 1998;Barton et al., 2017). They are assumed to be the sum of a genetic component, referred to as the breeding value, and an environmental component. They are observed with measurement errors: for 1 ≤ ≤ , the th observation is where is the grand mean, is the breeding value of individual , is an environmental effect, and is a measurement error. The residuals = + are assumed to be independent and normally distributed with mean 0 and variance 2 , that is, ∼  (0, 2 ). They are assumed to be independent of breeding values, that is, there are no genotype-environment interactions.

Modeling breeding values using the pedigree
The survey is assumed to involve some genetic data in addition to the phenotypic data. These data may consist of a pedigree dataset, that indicates the parents and the date of birth of each individual. Without loss of generality, dates are expressed here in generations, like in Wray (1990). This is more appropriate for populations with discrete generations, but any other time unit such as day or week could be used instead. The only restriction is that parents must appear in earlier time units than their offspring. The pedigree is assumed to have + 1 generations, denoted by 0, … , .
In order to study the sensitivity of designs to mutation, breeding values are modeled taking account of mutation. The mutations arising at generation 0 are not taken into account since they are considered as being part of the alleles present at this generation. New mutations are assumed to arise independently in the individuals of generation 1 and of subsequent generations. The alleles present at generation 0 and new mutations are assumed to have additive effects on breeding values (Wray, 1990) , where 0 is the component of the breeding value of individual that is the result of alleles inherited from ancestors born in generation 0, and for ≥ 1, is the component that is the result of mutations arising in generation . The vector 0 of genetic effects inherited from generation 0 is assumed to be normally distributed (Wray, 1990;Lynch and Walsh, 1998;Barton et al., 2017): where 2 0 is the additive genetic variance, and 0 is an × kinship (or relationship, similarity) matrix. This matrix can be computed in various ways. For the sake of simplicity, it is assumed here that 0 is under the control of a single locus with many alleles and that it is the sum of the contributions from the two alleles carried by individual , that is, there is no dominance (Gillespie, 1998, Chap. 5). Alleles are transmitted from parents to offspring by Mendelian inheritance. The entry 0 ′ is then equal to twice the coefficient of kinship between individuals and ′ (Wright, 1922;Gillespie, 1998, Chap. 5). This coefficient is the probability that two alleles, one sampled from individual and one sampled from individual ′ independently, are identical by descent. It is also assumed here that the individuals of generation 0 are not related and are not inbred, but this assumption can easily be relaxed if information on their relatedness and their inbreeding is available. For ≥ 1, the vector of mutation effects is assumed to be normally distributed (Wray, 1990): where 2 is the mutational variance, and is an × kinship matrix. Since is the result of mutations arising at generation , = 0 if individual was born before generation , with the convention that all the elements of involving this individual are equal to 0, that is, Since new mutations are transmitted from parents to offspring by Mendelian inheritance and are assumed to have additive effects, the elements of involving individuals born after generation − 1 are equal to twice their coefficient of kinship. Since mutations are assumed to arise independently in gametes, those arising at generation do not cause any inbreeding in the individuals born at generation .

Modeling breeding values using genotyping data
Alternatively, the genetic data may consist of a genotyping dataset, giving the genotype of all the individuals at biallelic loci. Let be the × matrix that specifies which alleles each individual carries at each locus. Specifically, the genotype of individual at locus is set to 1 (respectively, -1) if individual carries two copies of allele 1 (respectively, 2) at locus , and to 0 if individual is heterozygous at locus (VanRaden, 2008). The columns of are often standardized (VanRaden, 2008;Rincent et al., 2017;Gianola et al., 2020). As explained in the next paragraph, they are divided here by the square root of the average diagonal element of , where denotes transposition, yielding the matrix = ∕ √ , where = trace( )∕ . Alleles are assumed to have additive effects on breeding values, that is, there is no dominance and no epistasis: where is a parameter that quantifies the effect of locus . As increases, the genetic architecture of the trait is more complex. The vector of locus effects is assumed to be normally distributed (Gianola et al., 2020): where 2 is the additive genetic variance and is the identity matrix of size . Thus, the vector of breeding values is normally distributed with mean 0 and conditional variance matrix Var( | 2 , 2 ) = 2 , where = 2 and the variance component 2 is equal to 2 = 2 ∕ 2 . Thanks to the scaling of the columns of mentioned above, the prior variance of breeding values is approximately equal to 2 2 , which facilitates the specification of the prior distribution of 2 . The columns of are not centered here so that the grand mean represents the average trait value of a heterozygous individual at all the loci.

Prior distributions
This article uses a Bayesian framework. A vague normal prior distribution is placed on the intercept : | 2 ∼  (˜, 2˜) , where˜is a known constant and˜→ +∞.
The residual variance is assumed to follow a prior distribution with mean˜2. Variance components are assumed to follow prior distributions with means˜2 0 ,˜2 , and˜2.

SAMPLING METHODS
The objective of the sample survey is to estimate the breeding values of the individuals belonging to some target set  , or to estimate locus effects. Without loss of generality, the population is assumed to be the union of  and  . Choosing the design of the survey consists of choosing the sample, that is, individuals in  to be phenotyped.

Stratified sampling
The population may be sampled using SS. First, the individuals of the candidate set are divided into clusters using a hierarchical clustering. When the genetic data are pedigree data, the dissimilarity ′ between individuals and ′ is equal to ′ = √ ′ ′ − ′ , where = (˜2 0 ,˜2 ). When the genetic data are genotyping data, it is the Euclidian distance calculated from (Isidro et al., 2015). The design is then generated by sampling one individual in each cluster independently and uniformly.

criteria
Sampling designs may be generated by maximizing the criterion for breeding values (Atkinson and Donev, 1992, Chap. 6;Chaloner and Verdinelli, 1995;Verdinelli, 2000). This criterion quantifies estimation errors. When breeding values are estimated by their posterior mean, the matrix of squared errors is equal to Since the true breeding values are unknown and since observations are not available when the survey is designed, squared errors are averaged over the possible values of these variables, yielding the matrix where ( , ) is the joint density of and . This matrix is equal to the average posterior variance matrix of breeding values: The criterion for breeding values is based on the determinant of this matrix: When variance components have low prior variance, it is easier to compute, since it is approximately equal to (online Appendix A) is the square matrix of ones of size , and is equal to (˜2 0 ,˜2 ) or to (˜2) depending on the experimental setup. This expression assumes that the prior variance matrix of breeding values is invertible, which is often the case. It may be singular when, for example, two individuals are identical twins (or have the same genotype at all the loci), when the individuals born at generation 0 have been selected, or when < (Henderson, 1985(Henderson, , 1986VanRaden, 2008;Fernando et al., 2016).
Alternatively, sampling designs may be generated by maximizing the criterion for locus effects, which is equal to When the prior variance of 2 is low, this criterion is approximately equal to (online Appendix A) ≈ ln{det( −2 + )}.
When variance components have low prior variance, maximizing or is equivalent to maximizing the following version of the criterion (online Appendix A): = ln{det( + )}.
Thus, the criteria for estimating all of the breeding values or all of the locus effects can be optimized simultaneously.

Genetic algorithm
The criterion can be maximized using the genetic algorithm STPGA Akdemir and Isidro-Sánchez, 2019). This algorithm makes a population of designs evolve so that values increase in the population over iterations. At each iteration, designs with large values are selected in the population and produce offspring that forms the next generation. Offspring may not be identical to their parents because of mutation and recombination. Inefficient designs that were visited recently may be removed from the next generation (Tabu search). Designs generated using regression methods may be introduced in the next generation. This algorithm can be implemented using the function GenAlgForSubsetSelectionNoTest of the R package STPGA (Akdemir, 2018). In this article, each design of the initial population is generated by SS independently (Section 3.1). Maximizing using STPGA is computationally more efficient than maximizing since involves a matrix of dimension × only; it is more efficient than maximizing when < .

Explicit result
When the genetic data are pedigree data and when the prior variances of 2 0 and 2 are low, if the breeding values of the selected individuals are not correlated a priori and have maximal prior variance among , then the design is optimal over the set of all designs given  and (online Appendix B).

Simulation study
A simulation study was carried out to check that STPGA can optimize the criterion. Its R code can be found in Supporting Information. In this study, the population consisted of a theoretical pedigree comprising 80 individuals, which reproduced by outcrossing and which were divided into 10 unrelated families (see online Appendix C for more information). The candidate and target sets both comprised all the individuals. Sample size was either equal to 10 or to 70. Breeding values were modeled using the pedigree. In total, 80 design searches were carried out with different values of variance components in cases when optimal/efficient designs were known. For example, when = 10, sampling the individuals of the last generation was optimal (Section 4.1). STPGA performed rather well since it always generated designs as efficient as the reference designs, except in two cases.

4.3
Sampling a maize pedigree

Experimental setup
The influence of mating system and mutation on sampling was studied using a real maize population. To study the genetic mechanisms underlying the response to selection, a divergent selection experiment for maize flowering time was carried out in France (Durand et al., 2015). It yielded four early and four late subpopulations for flowering time.
For the sake of simplicity, we only considered one early subpopulation and one late subpopulation here, that were derived from a seed lot of the maize inbred line F252 by successively selecting and selfing the earliest and the latest plants at each generation. Each subpopulation had a single and different ancestor. Both subpopulations formed the target set that comprised 163 plants. These plants were dead and could not be phenotyped. However, they had been selfed and the resulting seeds could be sown to produce plants that could be phenotyped. These seeds, one for each plant of the target set, formed the candidate set of plants that could be phenotyped, which comprised 163 plants. The population included a total of = 326 plants that were distributed in + 1 = 17 generations ( Figure 1). It was structured since it comprised the early and late subpopulations. The pedigree of the population and some genotyping data were available, while other genotyping data remained to be collected. Scientists were interested in estimating both breeding values and locus effects. We studied how to sample = 50 plants in the candidate set.
In order to study the influence of mating system, a theoretical pedigree with outcrossing was considered in addition to the real pedigree with selfing. In this scenario, each plant that was not a founder of the pedigree had two parents. The first one was its parent in the original pedigree with selfing. The second one was an additional plant placed in generation 0, which differed between subpopulations. Thus, all the plants of generations 1 to 16 of a given subpopulation had the same second parent (Figure 2).

Methods
Kinship matrices were generated using the pedigree. The prior means of variance components took two possible values:˜2 0 = 0.25 or˜2 0 = 4,˜2 = 0.0001 or˜2 = 0.04. The explicit result of Section 4.1 did not allow the construction of optimal designs in this application. For each combination of values of variance components, a design was generated by maximizing the criterion using STPGA, and 100 designs were generated by SS (see online Appendix D for more information). In addition, 100 designs were generated by SRS.
Sampling methods were compared using the criterion. This criterion assumed that the target set consisted of the whole population, which was not the case in this application. To better assess designs, the maximal and average variance criteria were also computed: where | | was the cardinal of  . These criteria took the target set into account. As they were based on posterior variances, they

Results
Designs were sensitive to genetic variances and mating system. As the mutational variance increased, SS and O sampled more plants in late generations (Figures 1 and 2 and online Figures D.1-D.4). In addition, this effect was stronger when plants reproduced by outcrossing or when the prior mean of the additive genetic variance compo-nent˜2 0 was low. For example, when plants reproduced by selfing and˜2 0 = 0.25, O sampled more plants in early (respectively, late) generations when˜2 = 0.0001 (respectively,˜2 = 0.04) (Figure 1). SS and O tended to select individuals whose breeding values were weakly correlated a priori and had large prior variances (online Tables D.1

Explicit result
When the genetic data are genotyping data, when the prior variance of 2 is low, and when ≤ , if at each locus, half of the selected individuals has genotype -1 and the other half has genotype 1, and if locus effects are not correlated a posteriori (when > 1), then the design is optimal over the set of all designs given  and (online Appendix B). Locus effects are not correlated a posteriori when the matrix is diagonal, that is, when the genotypes of the selected individuals at different loci are not correlated.

Experimental setup
The influence of sample size and the genetic architecture of the trait on sampling was studied by means of simulations. The R code of this study can be found in Supporting Information. For the sake of simplicity, the experimental setup had a smaller size than real association studies: the population was composed of 16 individuals, genotyped at 18 loci. The candidate and target sets both consisted of the whole population. The population was divided into four subpopulations comprising four individuals. Allele frequencies varied between subpopulations (see online Appendix E for more information). Each locus was assumed to be at Hardy-Weinberg equilibrium in each subpopulation. Genotypes were simulated independently between individuals and between loci, that is, there was no linkage disequilibrium.
Sample size took three possible values: = 4, = 8, or = 12. The trait was assumed to be affected either by = 2 loci or by all the loci ( = 18). The prior mean of the additive genetic variance component˜2 was either equal to 0.25 or to 4.

Methods
For each sample size, 100 designs were generated by SRS. For each value and each sample size, 100 designs were generated by SS. When = 2, and when = 4 or 8, O designs were generated using the explicit result of Section 5.1. For the other parameter values, O designs were generated by optimizing the criterion using STPGA (see online Appendix E for more information). Sampling methods were compared using the criterion. To better quantify estimation errors for breeding values and locus effects, the maximal and average variance criteria ( , , , and ) were also computed for these parameters. For example, the criterion was equal to = 1 ∑ ∫ Var( | ) ( ) . The computation of these criteria assumed that the additive genetic variance component had low prior variance (online Appendix E). Estimation errors when = 4 and˜2 = 4 were also quantified for each sampling method by simulating trait values. The mean squared error of estimates and the mean correlation between estimated and true parameter values were estimated for breeding values and locus effects (online Appendix E).  Table E.1). As sample size increased and the genetic architecture of the trait was more complex, the advantage of optimized designs over SRS diminished. Criterion values were less dispersed for SS than for SRS. The performance of design methods was not very sensitive to˜2 (online Figures E.3 and E.4).
However, there were a few cases when optimized designs were less efficient. When = 18, all the sampling methods showed similar estimation errors for locus effects, with correlations between estimated and true parameters lower than 0.3 ( Figure 6, online Figure E.6 and online Table E.1). In this case, estimation errors were lower for breeding values than for locus effects, with correlations between estimated and true parameters larger than 0.6 (online Table E (Wray, 1990;Durand et al., 2010). Our results show that it is also useful to take account of mutation at the design stage.
In the association study, the additive genetic variance component did not greatly influence sampling designs, in agreement with earlier results . On the contrary, in the maize application, it influenced designs, with low values strengthening the influence of mutation. This is because when breeding values are modeled using the pedigree, variance components determine the relative importance of standing genetic diversity, mutation, and information provided by the data.

Sampling designs for association studies
This article investigated how to sample populations in association studies. Optimized designs tended to improve the joint estimation of breeding values and locus effects, all the more as sample size was low (Akdemir and Isidro-Sánchez, 2019). Both objectives were consistent when all breeding values and all locus effects were of interest, but they may be inconsistent when the survey is aimed at inferring some particular breeding values or locus effects.
When the genetic architecture of the trait was rather simple, optimized designs were reminiscent of classical experimental designs. Since they tended to sample homozygous individuals, they were similar to optimal designs for linear regression (Atkinson and Donev, 1992, Chap. 6). Since they tended to reduce the correlation between genotypes at different loci, they were similar to orthogonal factorial designs (Atkinson and Donev, 1992, Chap. 7;Kobilinsky et al., 2017). This is because our model can be considered as a linear model in which loci play the role of independent variables or factors.
When the genetic architecture of the trait was rather complex, optimized designs were not more efficient than SRS for estimating locus effects, and the correlation between estimated and true locus effects was low. This is why the number of loci was kept not much larger than population size in our association study. On the contrary, optimized designs were more efficient than SRS for estimating breeding values in this case, and estimation errors were lower for breeding values than for locus effects. Locus effects were difficult to estimate because the number of model parameters was larger than sample size. Our results raise the issue of identifying when high-dimensional regression methods provide reliable results (Verzelen, 2012;Giraud, 2014, Chap. 1).

Stratified sampling
This article developed an SS method for sampling related individuals. This method does not require an optimization step. It can be used with both informative or weakly informative prior distributions for variance components when kinship is modeled using the pedigree. In previous SS methods, the population was divided into large subpopulations and several individuals were sampled in each subpopulation (Isidro et al., 2015). The population was divided here into smaller clusters and one individual was sampled in each cluster. This may more effectively capture structures in the population and reduce the efficiency variability of sampling designs. However, efficiency variability may remain high with our method when clusters are large, as, for example, when sample size is much smaller than population size. SS performed rather well in our applications. This may be because populations were structured in these examples (Isidro et al., 2015). SS was not very efficient when the genetic architecture of the trait was simple and when sample size was not low. In this case, SS tended to sample more heterozygous individuals than O to increase the genetic diversity in the sample, which was not optimal. However, including heterozygous individuals in the sample could be useful to check model assumptions, in particular the absence of dominance (Atkinson and Donev, 1992, Chap. 20).