## Abstract

The genomic relationship matrix plays a key role in the analysis of genetic diversity, genomic prediction, and genome-wide association studies. The epistatic genomic relationship matrix is a natural generalization of the classic genomic relationship matrix in the sense that it implicitly models the epistatic effects among all markers. Calculating the exact form of the epistatic relationship matrix requires high computational load, and is hence not feasible when the number of markers is large, or when high-degree of epistasis is in consideration. Currently, many studies use the Hadamard product of the classic genomic relationship matrix as an approximation. However, the quality of the approximation is difficult to investigate in the strict mathematical sense. In this study, we derived iterative formulas for the precise form of the epistatic genomic relationship matrix for arbitrary degree of epistasis including both additive and dominance interactions. The key to our theoretical results is the observation of an interesting link between the elements in the genomic relationship matrix and symmetric polynomials, which motivated the application of the corresponding mathematical theory. Based on the iterative formulas, efficient recursive algorithms were implemented. Compared with the approximation by the Hadamard product, our algorithms provided a complete solution to the problem of calculating the exact epistatic genomic relationship matrix. As an application, we showed that our new algorithms easily relieved the computational burden in a previous study on the approximation behavior of two limit models.

THE genetic relationship between individuals plays an important role in the analysis of genetic data with population structure such as the prediction of breeding values and genome-wide association studies (GWAS). A classic definition of the genetic relationship is the pedigree kinship, which is the probability that a pair of randomly chosen alleles are identical by decent (IBD). The matrix of pedigree kinship among a group of individuals is the so-called numerator relationship matrix, and has been widely applied in best linear unbiased prediction (BLUP) of breeding values (Henderson 1975). A disadvantage of the pedigree kinship is that it does not account for the variation within a certain type of relationship (*e.g.*, between full-sibs) due to the Mendelian sampling and linkage (Hill and Weir 2011). The availability of dense marker genotypes enables the estimation of the realized kinship, defined as the actual proportion of genomes that are IBD between two individuals. Compared with the pedigree kinship, the realized kinship matrix is not only able to increase the accuracy of BLUP (Hayes *et al.* 2009), but also plays a key role in controlling the population structure in GWAS based on linear mixed models (Yu *et al.* 2006; Yang *et al.* 2014).

There are several estimators of the realized kinship matrix based on dense single nucleotide polymorphism (SNP) markers (Wang *et al.* 2017). Among those estimators, the genomic relationship matrix (GRM) estimates the kinship through the empirical correlation between individuals (VanRaden 2008), and is similar to the pedigree-based relationship coefficient of Wright (1922). A nice property of the GRM is that applying this matrix as the covariance matrix of the genetic values in the genomic BLUP (GBLUP) model is statistically equivalent to explicitly modeling marker additive effects in the ridge/random regression BLUP (RR-BLUP) model (Habier *et al.* 2007). Therefore, the GBLUP model with GRM as the covariance matrix has become one of the most popular models for genomic prediction because of its computational efficiency when the number of markers is large. It has also been generalized to include the dominance effects of markers by modeling a dominance genomic relationship matrix (DGRM) in addition to the additive GRM (Su *et al.* 2012; Vitezica *et al.* 2013).

In addition to dominance (intralocus interaction) effects, epistatic (interlocus interaction) effects also make a substantial contribution to the variation of complex traits (Carlborg and Haley 2004; Mackay 2014). Thus, it is natural to generalize the GBLUP model to include also the epistatic effects, for which the corresponding epistatic genomic relationship matrix (EGRM) serves as a central component. The EGRM has been extensively applied in genome-wide prediction exploiting epistatic effects (Xu *et al.* 2014; Jiang and Reif 2015; Zhao *et al.* 2015). In the area of GWAS, EGRM has also been popularly exploited as a control of epistatic background effects in addition to additive background effects (Xu 2013; Jiang *et al.* 2017; Runcie and Crawford 2019; Santantonio *et al.* 2019). Recently, a fast GWAS algorithm for an exhaustive scan of digenic additive-by-additive interaction effects was developed by constructing the test statistic of epistatic effects from the EGRM (Ning *et al.* 2018). Therefore, the application of EGRM has become increasingly common in genetic and genomic studies.

To apply the EGRM, which is defined as the covariance matrix of genomic estimated epistatic genetic values, the question is how to estimate it. An indirect but computationally efficient solution comes from the classic BLUP model with the pedigree-based numerator kinship matrix, for which the problem of calculating the corresponding epistatic kinship matrix has been solved. Namely, the dominance relationship matrix **D** can be derived from the numerator relationship matrix **A** and the epistatic relationship matrix can be generated using Hadamard (element-wise) product of **A** and **D** (Henderson 1985). This result was directly applied in many studies to calculate the EGRM (Su *et al.* 2012; Morota *et al.* 2014; Muñoz *et al.* 2014). That is, given a GRM **G*** _{A}* and a DGRM

**G**

*, the EGRM can be constructed using the Hadamard product of*

_{D}**G**

*and*

_{A}**G**

*. On the other hand, a direct way of calculating the EGRM is through its definition. The epistatic genetic values are expressed as a linear combination of marker interaction effects and the EGRM is calculated by the cross product of the corresponding coefficient matrix (Xu 2013; Xu*

_{D}*et al.*2014; Jiang

*et al.*2017). However, the computational load of this approach is much higher than the Hadamard product construction, especially when the number of markers and/or the degree of epistasis is high (Martini

*et al.*2020).

Therefore, an interesting question is whether the Hadamard product construction is compatible with the direct approach. This problem has been investigated in several studies (Jiang and Reif 2015; Martini *et al.* 2016; Vitezica *et al.* 2017; Martini *et al.* 2020). It turns out that the Hadamard product matrix implicitly models some extra intralocus interaction effects which is difficult to interpret from quantitative genetics point of view (Martini *et al.* 2020), which means that the Hadamard product matrix only provides an approximation of the EGRM.

It is then necessary to study the quality of the approximation. Ideally, we expect that the Hadamard product matrix approaches the EGRM when the number of markers is large. Hopefully, the limit of both matrices are the same (up to a constant scalar, because as covariance matrices they lead to equivalent statistical models) when the number of markers approaches infinity. Unfortunately, this problem is difficult to investigate because it depends on the choice of coding for the markers, and, in some cases, the limit may not exist at all (Martini *et al.* 2020). Only in the special case of degree 2 additive-by-additive interactions with the specific marker coding of VanRaden (2008), it was proved that the difference between the two matrices converges to zero (Jiang and Reif 2015). An alternative criterion was suggested to measure the quality of approximation and it works independent of the choice of marker coding (Martini *et al.* 2020). Nevertheless, good quality measured by this criterion does not mean that the Hadamard product matrix approaches the EGRM in a strict mathematical sense.

So, it is worthwhile to turn back to the direct way of calculating the EGRM. The reason why it is computationally inefficient is that it requires the formation of a large coefficient matrix as the design matrix of marker interaction effects. The size of the matrix grows very fast as the number of markers or the degree of epistasis increases. Thus, an efficient algorithm must be able to avoid the construction of this matrix. A promising strategy is to derive a formula for the EGRM only using the marker coding matrices and the Hadamard product of such matrices. In the case of only additive interactions, such a formula was provided when the degree of epistasis *s* = 2 (Jiang and Reif 2015; Martini *et al.* 2016). However, the task becomes more and more difficult as *s* increases. In Martini *et al.* (2020), formulas were derived for *s* = 3 and 4. But their approach is difficult to be generalized for arbitrary *s*. We are not aware of any studies in which general formulas were provided for any degree of additive and dominance interactions.

In this study, we aimed at a complete solution to the problem of calculating the precise form of the EGRM. We derived iterative formulas for generating the EGRM for arbitrary degree of additive and dominance interactions. Based on the formulas, we implemented efficient algorithms to calculate the EGRM. Using our formulas, we were able to provide a strict mathematical proof for the quality of approximating the EGRM by the Hadamard product with the specific coding of VanRaden (2008). As an application of our new algorithms, we showed that the empirical study of the approximation behavior of two limit models previously considered by Martini *et al.* (2020) can be easily extended from only affording 30 markers to allowing >1000 markers.

## Theory

### The classic BLUP model with nonadditive genetic values

The classic BLUP model including nonadditive genetic values has the following form (Henderson 1985):(1)where **y** is the *m* × 1 vector of phenotypic records, ** β** is the vector of unknown fixed effects,

**X**is the corresponding design matrix. The vector

**g**

*(1 ≤*

_{i}*i*≤

*t*) denotes the

*i*-th type of genetic values, which can be additive, dominance, or epistatic up to any degree.

**Z**is the corresponding design matrix allocating the phenotypic records to each genotype. The dimension of

**Z**is

*m*×

*n*, where

*n*is the number of individuals. When each genotype has only one record (which means

*m*=

*n*), elements in

**g**

*can be ordered in the same way as in*

_{i}**y**. Then in this case

**Z**= I is the

*n*×

*n*identity matrix.

The model assumptions are , for all , and for any *i*.

We introduce the following notations to conveniently indicate the type of genetic value represented by each **g*** _{i}*: let

**g**

*be the additive genetic value,*

_{A}**g**

*be the dominance genetic value, and be the genetic value of*

_{D}*s*times additive by

*t*times dominance epistasis. For example, is the additive-by-additive genetic value, is the additive-by-dominance-by-dominance genetic value. The same subscripts apply to the corresponding covariance matrix

**V**

*.*

_{i}In the BLUP model (1), the covariance matrix of the additive genetic value is **V*** _{A}* =

**A**, where

**A**is the numerator relationship matrix based on the pedigree information (Henderson 1976). For dominance genetic values,

**V**

*=*

_{D}**D**, where

**D**can be derived from the elements of

**A**(Henderson 1985).

For epistatic genetic values, the covariance matrix can be calculated by the Hadamard product of matrices (Cockerham 1954). For example, , , where ° denotes the Hadamard product. Denote by **A*** ^{ok}* the Hadamard product of

*k*copies of

**A**. In general, the covariance matrix for

*s*times additive by

*t*times dominance epistatic genetic values has the following form:

### The genomic BLUP model with nonadditive genetic values

The basic formula of the GBLUP model is the same as (1), but the covariance matrices **V*** _{i}* are GRMs derived from molecular markers across the genome instead of the pedigree-based matrices.

The most direct way of deriving GRMs is though an explicit expression of the genetic values **g*** _{i}* in terms of marker effects. In the additive case, this is the well-known equivalence between the GBLUP and RR-BLUP model (Habier

*et al.*2007). In the following, we briefly review the key part of the derivation because the same idea is used to derive nonadditive GRMs in subsequent subsections.

We assume that(3)where **M*** _{A}* is the

*n*×

*p*marker matrix with a certain additive coding,

*p*is the number of markers and

**is the vector of marker additive effects. We assume Then we haveBy the assumption of model (1), Thus, up to a scalar we can take as the additive GRM. To remind us that this matrix is derived from genomic data, we denote it by**

*a***G**

*, i.e.,(4)Similarly, let*

_{A}**M**

*be an*

_{D}*n*×

*p*matrix of markers with a certain dominance coding, then the corresponding DGRM is given by

Note that there are different choices for the coding of matrices **M*** _{A}* and

**M**

*, which correspond to different parametrizations of the genetic effects (Álvarez-Castro and Carlborg 2007). Here, we do not discuss the choice of codings, but focus on how to calculate the EGRM when certain additive and dominance codings are given.*

_{D}In view of the construction of epistatic relationship matrices for the classic BLUP model (2), a natural idea of calculating the EGRM is taking the Hadamard product of copies of **G*** _{A}* and

**G**

*. Namely,(5)However, it is known that the Hadamard product matrix is only an approximation of the EGRM. For example, in the case of only additive interactions (*

_{D}*i.e.*,

*t*= 0), it was shown that models not only the desired interlocus interactions but also intralocus additive-by-additive interaction effects, which is difficult to interpret (Martini

*et al.*2020).

In the next two subsections, we establish efficient methods to calculate the precise form of the EGRM. In particular, our methods are valid for any marker coding.

### Deriving the epistatic genomic relationship matrix: the case of only additive interactions

In this subsection, we derive the exact form of the EGRM for any degree of additive-by-additive interactions. Throughout the subsection we assume that a certain additive coding of markers has been chosen and **M*** _{A}* is the corresponding coding matrix.

Recall that denotes the vector of epistatic genetic values involving degree *s* additive-by-additive interactions (*s* ≥ 2). We start from an explicit expression of in terms of marker interaction effects.

The explicit expression of is:(6)where **M*** _{A,j}* is the

*j*-th column of the matrix

**M**

*, “○” denotes the element-wise product of vectors, and denotes the additive interaction effects of loci*

_{A}*i*

_{1},

*i*

_{2},… and

*i*.

_{s}Let be the matrix whose columns are comprised of products of *s* distinct columns of the matrix **M*** _{A}*. Namely, the columns of are the vectors for all possible

*i*

_{1},

*i*

_{2},…

*i*such that . Hence, is an dimensional matrix, where is the binomial coefficient. Let

_{s}

*a**be the dimensional vector consisting of all possible Then we can write (6) in the matrix form similar to (3):Now, similar to (4), we know that the corresponding genomic relationship matrix is calculated as(7)Since is an dimensional matrix, the computation complexity of (7) is Nowadays, it is usual to have at least tens of thousands of markers,*

^{s}*i.e.*,

*p*> 10,000. Then the number grows very fast as

*s*increases. Hence, it is not feasible to calculate using (7) when

*s*is large. For instance, a very small number of markers can be afforded for

*s*> 5 (Martini

*et al.*2020).

In the following, we deduce an efficient method to calculate for which we need to define a new matrix as follows:(8)Note that the calculation of this matrix is efficient because the complexity of calculating the Hadamard product of *s* copies of **M*** _{A}* is which is linear with respect to

*s*, and the multiplication of by its transpose has the complexity which is independent of

*s*.

The main theoretical result of this section is the following:

#### Proposition 1:

Given the marker coding matrix **M*** _{A}* and for any positive integer s with s ≥ 2, the epistatic genomic relationship matrix

*can be calculated iteratively as follows:(9)where we make the convention that*

*is the n ×*

*n*matrix of ones in all entries,

*as calculated in (4),*

*is defined as in (8).*

The proof of the proposition is based on an interesting observation that the elements in the matrix and can be viewed as the elementary and power sum symmetric polynomials of certain variables, respectively. Hence, the mathematical theory for symmetric polynomials (Macdonald 1998) can be applied. We postponed the proof to the Appendix, where we also provided necessary preliminaries on the theory of symmetric polynomials for readers’ convenience.

Note that the right hand side of (9) only involves for (1 ≤ *i* < *s*) and for (1 ≤ *i* ≤ *s*). As already mentioned, the matrix is easy to calculate. Thus (9) provides a recursive algorithm for calculating

Based on the iterative formula, it is also possible to derive an explicit formula to calculate directly using (1 ≤ *i* ≤ *s*):

#### Proposition 2:

Given the marker coding matrix **M*** _{A}* and for any positive integer s with

*s ≥*2, the epistatic genomic relationship matrix

*can be explicitly calculated as follows:(10)where*

*is defined as in (8).*

The proof of the proposition is also given in the Appendix. The above formula looks quite complicated as it requires to find out all possible *s*-tuple of non-negative integers *m*_{1}, *m*_{2},…,*m _{s}* such that However, we can expect that both (9) and (10) are much more efficient than (7) for calculating In the

*Results and Discussion*, we tested the computational efficiency of these methods with empirical data.

As examples, we can easily derive an explicit formula of for *s* ≤ 5 using (9) or (10):The above formula for *s* = 2 is well-known (Jiang and Reif 2015; Martini *et al.* 2016). Recently, Martini *et al.* (2020) derived the same formulas for *s* = 3,4 by detailed analysis of permutations (Martini *et al.* 2020). However, it was also mentioned that their approach is not easy to generalize for higher degree of *s*. Thus formulas for *s* ≥ 5 would be difficult to derive without using our general results (9) or (10).

As a final remark in this subsection, we can see from the above formulas (or, more generally, from Proposition 1 or 2) that the leading term of is noting that Thus, the Hadamard product of the additive GRM serves as the leading term of the EGRM, which confirms that it is only an approximation and not the exact form of the EGRM.

### Deriving the epistatic genomic relationship matrix: the general case

In the previous subsection we considered only the additive-by-additive interactions. Now we turn to the general case of considering additive and dominance interactions up to any degree. Throughout the section, we assume that a certain additive and dominance coding of markers have been chosen, and that **M*** _{A}*,

**M**

*are the corresponding coding matrices.*

_{D}Recall that denotes the genetic value of *s* times additive by *t* times dominance epistasis, is the corresponding EGRM to be derived. If *t* = 0, we have This is the case we have discussed in the last subsection. If *s* = 0, we have This corresponds to the similar case of only considering dominance interactions. In fact, all results in the last subsection, especially Propositions 1 and 2, hold true if we replace the subscripts *A* by *D*. Thus, from now on we assume that *s*, *t* ≥ 1. The structure of the following derivation closely parallels the previous subsection.

We start with an explicit expression of in terms of marker interaction effects:(11)where is the effect of interactions among the additive effects of loci and the dominance effects of loci The corresponding coefficient is the element-wise product of the columns of the matrix **M*** _{A}* and the columns of the matrix

**M**

*, namelyand the summation in (11) is taken over all possible and such that(*)Now, let be a matrix whose columns consist of those for all possible and satisfying condition (*). Note that the number of all possible such (*

_{D}*s*+

*t*)-tuples is calculated by the multinomial coefficient:Thus, is an dimensional matrix. Let

*ad**be the dimensional vector consisting of all possible Then we can write (11) in the matrix form:(12)Then the corresponding EGRM is calculated as(13)It is computationally inefficient to directly calculate using (13) because the complexity amounts to and the number grows very fast as*

^{st}*s,t*increase.

To develop an efficient method of calculating we define the following matrix:(14)which requires only linear computational complexity with respect to *s* and *t*: The formation of Hadamard product has the complexity and the multiplication of by its transpose has the complexity Note that as special cases, we have as defined in (8), and

The main result of this subsection is the following:

#### Proposition 3:

Given the additive and dominance marker coding matrices **M**_{A},**M*** _{D}* and for any positive integer

*s,t*with

*s, t ≥*1, the epistatic genomic relationship matrix

*can be iteratively calculated as follows:(15)where we make the convention that*

*is the n ×*

*n*matrix of ones in all entries,

*is defined as in (14).*

The proof of the proposition again relies on the mathematical theory of symmetric polynomials (Macdonald 1998). In particular, it requires a result similar to the Newton’s Formula concerning a special class of polynomials with two sets of variables. In the Appendix we first proved the required mathematical result and then applied it to prove the proposition.

Note that the Equation 9 obtained in the previous subsection is a special case of (15). Namely, taking *t* = 0 in (15), we see that *j* has to be zero, and Thus, (15) degenerates to (9) when *t* = 0. Similarly, if we take *s* = 0, we obtain the degenerated formula when only the dominance interactions are considered.

We can see that the right hand side of (15) only involves for either *k* < *s* or *l* < *t*, and for *i* ≤ *s* and *j* ≤ *s*. As already mentioned, the matrix is easy to calculate. Thus, (15) provides a recursive algorithm for efficiently calculating

It is possible to derive an explicit formula for in terms of similar to (10) using (15). But it is expected to be even more complicated than (10) and it may not be computationally more efficient than the recursive algorithm based on (15). So we decided not to derive the explicit formula in this study.

As examples, we can easily calculate for *s*, *t* ≤ 2 using (15):The formula for **G*** _{AD}* can be easily derived by definition. Using the method in Martini

*et al.*(2020), one should be able to derive formulas for and Besides these easy cases, the formulas are difficult to derive without using our result (15).

In view of the above formulas (or more generally, from Proposition 3), we see that the Hadamard product serves as the leading term of

### The expectation of the epistatic genomic relationship matrices

A natural and interesting question is why the pedigree-based epistatic relationship matrix is given by the Hadamard product of the corresponding additive and dominance relationship matrix while the marker-based EGRM is not. In this subsection, we try to give an explanation by investigating the expectation of the EGRM under certain assumptions.

According to Cockerham (1954), the pedigree-based epistatic relationship matrix is given by the Hadamard product in three situations: (1) relatives in a randomly mating population, (2) relatives in the same generation of a self-fertilizing population and the gene frequencies are one-half, and (3) relatives which are offspring from randomly mated inbred parents. It was also assumed that the genotype frequencies for all relatives are proportional to the marginal frequencies of the loci and the genes recombine independently. Here in this subsection, we assume that the population is in Hardy-Weinberg Equilibrium (HWE), which covers the situation (1) and (3). We also assume that markers are in linkage equilibrium (LE).

With these assumptions, it is known that the expectation of the additive genomic relationship matrix of VanRaden (2008) is the numerator relationship matrix **A** (Habier *et al.* 2007; Gianola *et al.* 2009). More precisely, let **M*** _{A}* be coded as follows: at each locus

*i*, an individual whose number of reference allele is 0, 1, 2 was coded as −2

*p*, 1 − 2

_{i}*p*, 2 − 2

_{i}*p*, respectively, where

_{i}*p*is the frequency of the reference allele. Then, we have where and Note that is the VanRaden G-matrix and

_{i}In the following, we briefly recall the derivation in Habier *et al.* (2007): let *a _{jk}* be the (

*j*,

*k*) entry of

**A**and denote the (

*j*,

*i*) entry of

**M**

*by*

_{A}*m*, where 1 ≤

_{ji}*j*,

*k*≤

*n*and 1 ≤

*i*≤

*p*. Then, the (

*j*,

*k*) entry of

**G**

*is Then, we havewhere, for the second equality, we used the fact which is a result of the coding system, and, for the fourth equality, we need HWE so that*

_{A}Now, we investigate the expectation of the EGRM Recall that and is an matrix whose columns consist of entry-wise product of two distinct columns of **M*** _{A}*. Therefore, the (

*j*,

*k*) entry of isThus, we can derive thatIn the above derivation, the second equality used the fact by the assumption of LE between markers. The third equality is a result of Cockerham (1954), which also requires LE and some other assumptions mentioned at the beginning of the subsection.

Therefore, we have the following result:(16)where .

On the other hand, we can also calculate the expectation of the entries in the Hadamard product which is proportional to only if

To summarize, we have shown that the expectation of the degree 2 additive-by-additive EGRM equals the Hadamard square of the numerator relationship matrix **A** up to a scalar, while the expectation of the Hadamard square is not, unless the variance of each entry in **G*** _{A}* is zero. Tracing back to the original definition of the pedigree and realized kinship (Hill and Weir 2011), we can see that the above result justifies the role of EGRM and hence provides a conceptual explanation on why the EGRM is not equal to the Hadamard product

The generalization of (16) to arbitrary degree is straightforward: using the same argumentation line, we can show that:(17)where In the case involving dominance-by-dominance interactions, Vitezica *et al.* (2013) suggested a coding system for **M*** _{D}* and the corresponding

**G**

*is the marker-based counterpart of the pedigree-based dominance relationship matrix*

_{D}**D**. Thus, similar to the additive-by-additive case, we can prove that the expectation of equals the Hadamard product

**D**

*up to a scalar.*

^{os}In the general case that both additive and dominance interactions are involved, we can prove the following result: the expectation of the EGRM equals the Hadamard product up to a scalar. It can be derived using the same method as before with an additional assumption that the additive and dominance coding in **M*** _{A}* and

**M**

*must be orthogonal, which is the same as in Cockerham (1954). For instance, the coding system called “natural orthogonal interaction (NOIA) model” introduced by Álvarez-Castro and Carlborg (2007) fulfills the request.*

_{D}Last, but not least, we would like to emphasize that, for the pedigree-based epistatic relationship matrix taking the form of Hadamard product, certain assumptions have to be fulfilled (Cockerham 1954). We also need HWE and LE to establish the link between the EGRM and the pedigree-based epistatic relationship matrix. The situation becomes more complicated when these assumptions are violated. For instance, it was suggested that more complex modeling may be needed for non-HWE populations (Smith and Mäki-Tanila 1990; Mao *et al.* 2006). It would be interesting to study the link between the EGRM and the pedigree-based relationship matrix in this situation.

## Materials and Methods

### Implementing the algorithms

Based on Propositions 1 to 3, three algorithms for calculating the EGRM were implemented. In the case of only considering additive epistasis, we implemented the recursive algorithm EGRM-A1 based on (9) and the direct algorithm EGRM-A2 based on (10). In the general case, we implemented the recursive algorithm EGRM-AD based on (15).

The implementation of EGRM-A1 and EGRM-AD is almost straightforward. Given the marker coding matrices **M*** _{A}* and

**M**

*, the matrices for 1 ≤*

_{D}*k*≤

*s*(or for ) can be easily calculated using (8) or (14). Then, one can write functions to calculate or recursively following (9) or (15), respectively.

For EGRM-A2, one needs to find out all possible *s*-tuple of non-negative integers such thatThis is equivalent to the mathematical problem of finding all partitions of the integer *s* (A *partition* of *s* is given by a sequence where and ). Thus, we applied a fast algorithm for generating integer partitions developed by Stojmenović and Zoghbi (1998). Once this is done, we simply follow (10) for the remaining calculations.

For certain coding of markers, the entries in the resulting EGRM can be very large. Thus, in all our algorithms we standardized the EGRM by dividing the average of the diagonal entries as the final output, i.e.,where Tr denotes the trace operator for matrices, which is the sum of all diagonal elements. The standardization was suggested by Vitezica *et al.* (2017) as it standardized the cross product matrix to a variance of 1. The standardization factor was designed as part of the output of our algorithms. So users can easily change it according to their own interest.

All algorithms were implemented using the statistical software R version 3.5.2 (R Core Team 2019), compiled with the open source basic linear algebra subprograms library OpenBLAS (Wang *et al.* 2013).

### Testing the algorithms with empirical data sets

To test the efficiency of our algorithms, we used the genomic data from two public data sets. One is from the CIMMYT’s Global Wheat Program (Crossa *et al.* 2010). It consists of 599 wheat lines genotyped with 1447 Diversity Array Technology (DArT) markers. The other is a mice data set comprised of 1814 individuals genotyped with 10,346 SNP markers (Valdar *et al.* 2006).

To compare the computational efficiency of EGRM-A1 and EGRM-A2, we recorded the computational time of calculating for *s* up to 10 with the above two data sets. To test the algorithm EGRM-AD, we calculated the matrix for *s*, *t* up to 5. All computations were performed on a single desktop computer with Intel Core i5-6200 CPU (2 × 2.3 GHz).

### Data availability

The authors state that all data necessary for confirming the conclusions presented in the article have been published and embedded in the R package BGLR (Pérez and de Los Campos 2014). The R version compiled with OpenBLAS was downloaded from https://github.com/thequackdaddy/R-OpenBLAS/releases. Source codes of R functions implementing the algorithms EGRM-A1, EGRM-A2, and EGRM-AD were provided in File S1. Sample R codes illustrating how to use these functions were provided in File S2. Supplemental material available at figshare: https://doi.org/10.25386/genetics.12962906.

## Results and Discussion

### The efficiency of the algorithms

In the case of only additive epistasis, we provided two algorithms for calculating the EGRM and compared their computational efficiency (Figure 1). The results indicated that the recursive algorithm (EGRM-A1) is more efficient than the direct one (EGRM-A2) when the degree of epistasis *s* <10. For the wheat data set, we additionally tested the computational time for *s* up to 20. Interestingly, we found that EGRM-A2 became more efficient than EGRM-A1 after *s* = 16. When *s* = 20, the EGRM-A1 algorithm took 4323 sec whereas EGRM-A2 only took 981 sec. This can be explained by the computational complexities of the two algorithms. The most time-consuming part in EGRM-A2 is to generate all possible partitions for *s*, which has a complexity of (Stojmenović and Zoghbi 1998), where *P*(*s*) is the number of all partitions for *s* and it grows as an exponential function of (Andrews 1998). The recursive algorithm has a complexity of due to its iterative manner. So it may take more time for EGRM-A2 to generate all possible partitions when *s* is small compared with EGRM-A1. But the iterative process becomes more time-consuming as *s* increases. In practice, there may be very rare cases in which we need to calculate the EGRM for *s* > 10; thus, in the following, we focused on EGRM-A1.

For the wheat data set, which is relatively small (599 individuals and 1447 markers), EGRM-A1 completed the calculation within 5 sec, even when *s* = 10 (Figure 1A). This is a tremendous improvement compared with the method of direct calculation by definition (7), which is only feasible for <50 markers in the case of high-order epistasis for the same data set (Martini *et al.* 2020). For the mice data set, which has three times more individuals and nearly seven times more markers, EGRM-A1 took ∼60 sec when *s* = 10 and <5 sec for *s* < 5 (Figure 1B).

In the general case, the EGRM-AD algorithm completed the calculation of the EGRM within 50 sec when *s* + *t* < 6 for the mice data set (Figure 2). Thus, our algorithms are efficient for calculating the EGRM for moderately high-order epistasis, which should be sufficient for most applications in genetic and genomic studies.

### Application 1: the quality of approximation by the Hadamard product

Although the main purpose of this study was not investigating the quality of approximating the EGRM by the Hadamard product, our iterative Equations (9) and (15) provided tools to study this problem. As an application, we investigate in this subsection the quality of approximation with the specific coding of the VanRaden G-matrix (VanRaden 2008). We provide a strict mathematical proof that the difference between the entries in the Hadamard product matrix and the precise EGRM approaches zero as the number of markers increases, which implies a good approximation quality if both matrices have a limit in which each entry is different from zero (Martini *et al.* 2020). More precisely, the following proposition holds:

#### Proposition 4:

Let **M*** _{A}* be coded as follows: At each locus i, an individual whose number of reference allele is 0,1,2 was coded as ,

*,*

*respectively, where p*

_{i}is the frequency of the reference allele and

*Then, for any positive integer s, we have(18)where*

*is defined as in (7) and the limit applies entry-wise.*

Note that the original coding of VanRaden G-matrix is −2*p _{i}*, (1 − 2

*p*), (2 − 2

_{i}*p*) and the number

_{i}*c*occurs a scaling factor. Here, we directly divided by in the coding so that no scaling factor is needed. This is only to simplify the expression of (18) and its proof.

##### Proof:

Here, we show the main argumentation line of the proof and leave the technical details in the Appendix. The proposition is proved by mathematical induction. For *s* = 1, (18) is trivial. For *s* = 2, the result has been proved in Jiang and Reif (2015). Thus, we assume that (18) holds for (*r* > 2), and only need to prove that it holds for *s* = *r*.

We expand the left hand side of (18) using (9):(19)where we used the property that the Hadamard product is distributive over addition and the fact that

Now, we make the following two claims whose proofs are provided in the Appendix:

**Claim 1.**For any*s*with 1 ≤*s*≤*r*− 1, the entries in are*bounded*,*i.e.*, there exists*N*> 0 such that for all 1 ≤_{s}*j*,*k*≤*n*. In particular, the bound*N*is independent of the number of markers._{s}**Claim 2.**For any*s*≥ 2, the entries in vanish as*p*approaches infinity,*i.e.*,

With the help of the two claims, we can derive that:(20)because the entries in **G*** _{A}* are bounded, andby the inductive hypothesis. Here, we used a general result in the limit theory of sequences: the product of two sequences converges to 0 if one is bounded and the other converges to 0 (Zorich 2016).

We also know that, for any *i* with 2 ≤ *i* ≤ *r*,(21)because the entries in are bounded and vanishes as *p* → ∞.

Using (20) and (21), we can see that the limit of the left hand side of (19) is zero as *p* → ∞, which completes the proof.

Note that our proof depended on the specific coding and scaling factor of VanRaden (2008). Thus, it would be interesting to investigate whether similar results hold for the other coding systems and scaling factors. In the additive case, there are mainly two other commonly used coding systems, namely {−1,0,1} and {0,1,2}. But the choice of scaling factor also plays a role, which hinders a direct generalization of our proof for other coding systems. The problem becomes more complicated when dominance relationship is involved as various coding systems for the dominance effects were suggested and discussed in the literature (Zeng *et al.* 2005; Álvarez-Castro and Carlborg 2007; Vitezica *et al.* 2013). Nevertheless, our iterative formulas for calculating the EGRM are independent of the coding system. Thus, they may still serve as effective tools to tackle the problem in the future.

### Application 2: an empirical study on two limit models

In the last subsection, we presented a theoretical application of our results. Here, we show an application of our efficient algorithms in data analysis. Martini *et al.* (2020) conducted an empirical study on two limit models. More precisely, they studied the quality of approximating two models whose covariance matrices are given by a combination of EGRMs with all possible degrees up to the number of markers. For each model, the correlation between the precise form of the covariance matrix and its approximation was calculated using a data set consisting of 109 individuals and 1148 SNP markers, which is a subset of the maize data published in Crossa *et al.* (2010). Due to the computational restrictions, they were not able to calculate the EGRM for all *s* ≤ *p* when *p* > 30. Here, we took the same data set and revisited the two limit models. Our new algorithms enabled us to easily investigate the complete behavior of approximation with all 1148 markers.

In the following, we briefly describe the two-limit models. For more details we refer to Martini *et al.* (2020). Note that only the additive-by-additive epistasis was considered and the variance component was assumed to be the same for any degree of epistasis.

**Limit Model 1.**In this model, the weight of the EGRM of each degree is the same. Thus, the covariance matrix is the following:(22)

As we expect that can be approximated by , an approximation of (22) is given by:(23)which can be further approximated for large *p* by(24)where the operation of the exponential function is entry-wise, and 1_{n}_{×}* _{n}* denotes the

*n*×

*n*matrix with 1s in all entries.

**Limit Model 2.**In this model, the EGRM of degree*s*is given a weight*s*!. Thus, the covariance matrix is of the following form:(25)

which can be approximated by(26)If we assume that for any *j* and *k*, (26) can be further approximated for large *p* by(27)in which all operations are entry-wise.

For an empirical study on the two limit models, we adopted the coding system used by Martini *et al.* (2020). Namely, the markers were coded as {−1,0,1} in Model 1 and in Model 2. Dividing the square root of *p* in the latter was to fulfill the assumption that Using our efficient algorithm EGRM-A1, we calculated the correlation between (22) and (24) in Model 1, and the correlation between (25) and (27) in Model 2 for *s* = 1,2,…,*p*. Our empirical results were shown in Figure 3, which clearly indicated that the quality of approximation for both models is good for large number of markers.

Note that when *P* ≥ 447, the numbers occurred during the calculation of the correlations for Model 1 exceeded the maximum-representable normalized floating-point value in R. Thus, we can only show the correlations for *P* < 447 for Model 1 in Figure 3. The reason why such large numbers appeared in the calculation is that no scaling factor was applied to the EGRM With the {−1,0,1} coding, the entries in and increased very fast.

We also tried to use the coding as in Model 2. In this case, the correlation for Model 1 can be calculated up to all number of markers and it approached 1 much quicker than the one with the {−1,0,1} coding. Thus, the behavior of the approximation in the two-limit models depends on the coding system, even if two coding systems differ only by a scalar (in our case, ). The reason is also clear: for any fixed *s*, the correlation between the EGRM and the Hadamard product does not change if we just modify the coding system by a scalar, because both matrices will change by a common factor. However, we deal with a linear combination of EGRMs or Hadamard products with different degree *s* in both limit models. Modifying the coding system by a scalar will result in different changes in the EGRMs or Hadamard products with different degree *s*. Therefore, the correlation between the combination of EGRMs and the Hadamard products under the modified coding system will not be the same as before. As a consequence, the contribution of different components to the combined covariance matrix (22) or (25) is determined not only by the weights but also by the scaling factors resulted from coding. We should also note that the predictive ability of a model including epistasis also depends on the coding system (He and Parida 2016; Martini *et al.* 2017, 2019). Further investigations on this direction would be very interesting, but is beyond the scope of this study.

### Outlook for further applications

Many previous studies in genomic prediction, variance component analyses or GWAS took the Hadamard product matrix as the approximation of EGRM and focused only on low-degree epistasis (Muñoz *et al.* 2014; Vitezica *et al.* 2018; Runcie and Crawford 2019). Although when the degree of epistasis is low and the number of markers is large, the quality of approximation is expected to be good for any coding system according to a criterion suggested in Martini *et al.* (2020), we still do not know whether the Hadamard product matrix approaches the genuine EGRM in a strict mathematical sense (Martini *et al.* 2020). Since our algorithms are able to calculate the precise form of the EGRM at very little computational cost, there is no longer need to use the Hadamard product matrix as the approximation.

In some cases, it is necessary to calculate the EGRM for a relatively small number of markers. For example, some studies in genomic prediction suggested exploiting local epistasis, *i.e.*, epistatic effects among markers within a certain region (Akdemir and Jannink 2015; Jiang *et al.* 2018). In GWAS it was proposed to select SNPs contributing to the trait under consideration and only use these SNPs to construct the GRM as a control of the population structure (Listgarten *et al.* 2012; Liu *et al.* 2016). This idea can be generalized to the case of including epistasis, for which one needs to calculate the EGRM with a possibly small subset of markers. Recently, a regional epistasis mapping approach was developed to test the interaction between two chromosomal regions based on the EGRM, which was calculated using markers in the two regions (Santantonio *et al.* 2019). In these cases, it is expected that the quality of approximation by Hadamard product reduced compared with the case of large number of markers (Martini *et al.* 2020). The same holds true when high-order degree of epistasis was considered (Martini *et al.* 2020). Therefore, it is still recommended to apply the proposed algorithms for calculating the exact form of EGRM instead of the Hadamard product approximation.

## Acknowledgments

We would like to thank Yusheng Zhao and Jie Sheng for helpful discussions on the mathematical proofs in the manuscript. We are grateful to the three anonymous reviewers for their valuable suggestions.

## Appendix

### Preliminaries on symmetric polynomials

Here we provide a very brief summary of some definitions and results from the mathematics theory of symmetric polynomials, which is needed to prove the main results of this study. For more details and proofs of theorems, we refer to Macdonald (1998). Throughout this section, *p* is a positive integer and denote *p* distinct variables.

**Definition A.1.**A polynomial*is called symmetric if for any permutation σ of subscripts 1,2,…,**p*, one has

For example, is a symmetric polynomial with two variables; is a symmetric polynomial with three variables. is not a symmetric polynomial.

**Definition A.2.**For any positive integer k, the k-th elementary symmetric polynomial*is the sum of all possible products of k distinct variables. Namely, for 1 ≤**k*≤*p*,

*For k* = 0, we make the convention that * For k > **p*, there is no k distinct variables so

For example, the following is a complete list of nontrivial elementary symmetric polynomials in three variables:

**Definition A.3.***For any positive integer k, the k-th power sum symmetric polynomial*is the sum of the k-th power of all variables, i.e.

For example, ,

The following theorem is called Newton’s Formula as Equation (2.11′) in Macdonald (1998). It establishes a link between the elementary symmetric polynomials and the power sum symmetric polynomials:

**Theorem A.4.***For simplicity, we write*and*The following identity holds for any s such that 1 ≤**s*≤*p*:

Newton’s Formula provides a iterative method to express *e _{s}* (

**x**) in terms of

*p*(

_{i}**x**) with 1 ≤

*i*≤

*s*and

*e*(

_{j}**x**) with 1 ≤

*j*<

*s*. We can also use this formula to explicitly write

*e*(

_{s}**x**) in terms of

*p*(

_{i}**x**) (1 ≤

*i*≤

*s*). The result is the following complicated formula:

**Theorem A.5.**Using the notations of Theorem A.4, we have the following identity:

### A mathematical result generalizing Newton’s formula

In this subsection, we state and prove a mathematical result regarding a special class of polynomials with two sets of variables. The result is important for proving Proposition 3 in the Theory section.

**Definition A.5.**Let*and**be two sets of distinct variables. For simplicity, we write**and**For any positive integers s and t, We define the following two types of polynomials:*

As special cases, we define * ** ** ** We also make the convention that *

The following result can be viewed as a generalization of Newton’s formula (Theorem A.4). Although we did not find it in the mathematics literature, it can be proved using similar methods as in the proof of Newton’s formula (Macdonald 1998). For readers’ convenience, we present a complete proof.

**Theorem A.6.**For any positive integers*s*and*t*, the following identity holds:(28)

*Proof.* In view of the definition of and introducing two independent variables *u* and *v*, we have the following identity:(29)This identity is treated in the ring of formal power series where is the ring of polynomials in with integers as coefficients. Mathematical details on these notions are out of range of this paper, so we refer to Macdonald (1998).

Taking partial derivatives with respect to *u*, we obtain the following:Multiplying *u* with both sides, we have(30)Similarly, we take partial derivatives with respect to *v* on both sides of (29) and multiply with *v*. This gives the following:(31)Now summing (30) and (31) together, we have(32)As formal power series, we haveThus, we have the following:(33)Taking (29), (32), and (33) together, we obtain the following identity:(34)Now, by comparing the coefficients of the term *u ^{s}v^{t}* (for any fixed

*s*and

*t*) on both sides of (34), we obtain (28) as required.

### The proof of Propositions 1 and 2

Recall that is the additive EGRM of degree *s* and **M*** _{A}* is the

*n*×

*p*marker matrix with additive coding, where

*n*is the number of individuals and

*p*the number of markers. The strategy of the proof is as follows: first, we reveal that the entries in correspond to the elementary symmetric polynomials and the entries in the cross product of Hadamard powers of the marker matrix correspond to the power sum of symmetric polynomials. Then, we apply Theorem A.4 and Theorem A.5 to obtain formulas of calculating using

We start by looking at each entry of in details. For 1 ≤ *j* ≤ *n* and 1 ≤ *i* ≤ *p*, we denote the (*j*, *i*) entry of **M*** _{A}* by Recall that is the matrix whose columns are comprised of products of

*s*distinct columns of the matrix

**M**

*. Thus, the rows of are indexed by the individuals and the columns are indexed by*

_{A}*s*-tuple of integers such that In particular, the elements in are presented byNow, we focus on the calculation of the (

*j*,

*k*) element of So we fix

*j*and

*k*for 1 ≤

*j*,

*k*≤

*n*. Using (7), we know that the (

*j*,

*k*) entry of is calculated as follows:(35)For any subscript

*i*with 1 ≤

*i*≤

*p*, we define (since

*j*and

*k*are fixed we drop these subscripts in

*z*for simplicity). Then we can rewrite (35) as the following:(36)where is the elementary symmetric polynomial defined in Definition A.2. This is the key observation that motivated us to apply the mathematical theory of symmetric polynomials to solve the entire problem.

Next, we recall the *n* × *n* matrix [see (8)]. Then the (*j*, *k*) element of is:(37)where is the power sum symmetric polynomial (Definition A.3).

In view of (36) and (37), we have shown that and correspond to the elementary and power sum symmetric polynomials with as variables. Thus applying the Newton’s Formula (Theorem A.4), we obtain the following iterative formulaSince the above formula holds for each pair of subscripts (*j*, *k*) (1 ≤ *j*, *k* ≤ *n*), Proposition 1 is proved.

Similarly, applying the explicit formula obtained in Theorem A.5 to each pair of subscripts (*j*, *k*), we complete the proof of Proposition 2.

### The proof of Proposition 3

As in the last subsection, we start by looking at each entry of in details. Recall that **M*** _{A}*,

**M**

*are the*

_{D}*n*×

*p*marker matrices with additive and dominance coding, where

*n*is the number of individuals and

*p*the number of markers. For 1 ≤

*k*≤

*n*and 1 ≤

*i*≤

*p*, we denote the (

*k*,

*i*) entry of

**M**

*, and*

_{A}**M**

*by and respectively. Recall that is the matrix whose columns consist of products of*

_{D}*s*distinct columns of

**M**

*and*

_{A}*t*distinct columns of

**M**

*(in addition, all the*

_{D}*s*+

*t*columns must be distinct). Thus, the rows of are indexed by the individuals and the columns are indexed by

*s*+

*t*-tuple of integers satisfying condition (*) in the Theory section. We recall the condition (*) in the following:(38)Hence, the elements in are presented byIn the following, we focused on the calculation of the (

*k*,

*l*) element of So we fix

*k*and

*l*for 1 ≤

*k*,

*l*≤

*n*. Using (13), we know that the (

*k*,

*l*) entry of is calculated as follows:(39)where * is the condition specified in (38).

Now, for any subscript *i*, *j* with 1 ≤ *i*, *j* ≤ *p*, we define and (since *k* and *l* are fixed we drop these subscripts in *z* for simplicity). For further simplification we write and Then we can rewrite (39) as follows:(40)where is the polynomial defined in Definition A.5.

As the next step, we recall the *n* × *n* matrix [see (14)]. Then, the (*k*, *l*) element of is:(41)where is the polynomial defined in Definition A.5.

Thus, we have shown in (40) and (41) that and correspond to the polynomials defined in Definition A.5 with **z*** ^{A}*,

**z**

*as variables. Thus, we can apply Theorem A.6 and obtain the following iterative formula:Noting that the above formula holds for each pair of subscripts (*

^{D}*k*,

*l*) (1 ≤

*k*,

*l*≤

*n*), we completed the proof of Proposition 3.

### The proof of the two claims in Proposition 4

For simplicity, we may assume that the minor allele is chosen as the reference allele for each locus. And it is typical in the genetic data analysis to filter the markers with a certain threshold of minor allele frequency (MAF) *p*_{0}. Thus the allele frequency for all *i*.

We first show that the entries in **G*** _{A}* are bounded. We know that the (

*j*,

*k*) entry of

**G**

*iswhere*

_{A}Using the assumption and the fact that we can easily deduce that and Then, we haveThus, we have shown that the entries in **G*** _{A}* are bounded and it is important that the bound is independent of the number of markers. Since the entries in the Hadamard product is just the

*s*-th power of the entries in

**G**

*, they are also bounded. By the inductive hypothesis, we know that for 1 ≤*

_{A}*s*≤

*r*− 1,Therefore, the entries in is also bounded, for any

*s*with 1 ≤

*s*≤

*r*− 1. The proof of Claim 1 is completed.

Next, we take a detailed look at the entries in for *s* ≥ 2. Recall that . Thus,Again using the assumption and the fact that we can deduce that(42)If *s* is even, it is clear that If *s* is odd, we deduce in a similar way that(43)Combining (42) and (43), we obtain that for any *s* ≥ 2. The proof of Claim 2 is completed.

## Footnotes

Supplemental material available at figshare: https://doi.org/10.25386/genetics.12962906.

*Communicating editor: H. Daetwyler*

- Received August 12, 2020.
- Accepted September 21, 2020.

- Copyright © 2020 by the Genetics Society of America