Chapter 6 Estimating the Strength of Expression Conservation

6.1 Theory

6.1.1 Stabilizing selection model in transcriptome evolution across species

The Ornstein-Uhlenbeck (OU) model, which claims that expression changes are constrained by the stabilizing selection, is biologically more realistic than a simple Brownian motion (BM) model(Rohlfs, Harrigan, and Nielsen 2014 Lemos et al. (2005)Gu and Su (2007)Bedford and Hartl (2009)Brawand et al. (2011)).

Intuitively speaking, the OU model includes two opposite processes:

  • random mutations push the expression level (\(x\)) of a gene away from the optimum (\(\mu\)), a process may suffer from a fitness reduction;
  • the return process to the expression optimum (\(\mu\)) driven by a positive selection; the strength of this ‘elastic’ return increases proportionally with w, the coefficient of stabilizing selection. Given the initial expression value \({x_0}\), the OU model predicts that \(x\left(t\right)\), the expression level after \(t\) evolutionary time units, follows a normal distribution with the mean \(E\left[x\ \mid\ x_0\right]\) and variance \(V\left[x\ \mid\ x_0\right]\)

\[ \begin{split} E\left[x\ \mid\ x_0\right]\ &=\ \mu\left(1-e^{-\beta t}\right)\ +\ z_0e^{-\beta t} \\ V\left[x\ \mid\ x_0\right]\ &=\frac{1-e^{-2\beta t}}{W} \end{split}\tag{3.1} \]

respectively, where \(\beta\) is the rate of expression evolution and W is the strength of expression conservation; symbolically one may write an OU process by \(OU\left(x\ \mid\ x_0;\ \theta\right)\), where \(\theta\ =\ \left(\mu,\ \beta t,\ W\right)\) is the parameter vector.

6.1.2 Stationary OU model under a species phylogeny

We shall develop a statistical method to estimate the strength (\(W\)) of expression conservation of a gene from high throughput RNA-seq data of multiple species. Figure 6.1 illustrates the evolutionary scenario used in the evolutionary transcriptome analysis.

  • The first component is the evolutionary lineage from the origin of the tissue (node \(Z\)) to the root (node \(O\)) of the species phylogeny, with \(\tau\) evolutionary time units.
  • The second component is the conventional species phylogeny with \(n\) species. That is, given the initial expression value \(z_0\) at node \(Z\), the OU process of \(x_0\) in the lineage from \(Z\) to \(O\) is given by \(OU\left(x_0\mid z_0;\ \theta\right)\), where the parameter vector \(\theta=\left(\mu,\ \beta\tau,\ W\right)\). The joint density of expressions \(\boldsymbol{x}=\left(x_1,...x_n\right)\) conditional of the expression level (\(x_0\)) at root \(O\), denoted by \(P\left(\boldsymbol{x}\mid x_0\right)\), can be derived under the OU model. It follows that the joint expression density of \(\boldsymbol{x}=\left(x_1,...x_n\right)\) conditional of \(z_0\) is given by

\[P\left(\boldsymbol{x}\mid z_0\right)=\int_{\ -\infty}^{\infty}OU\left(x_0\mid z_0;\ \tau\right)P\left(\boldsymbol{x}\mid x_0\right)dx_0\tag{3.2}\] (Hansen and Martins 1996) showed that either \(P\left(\boldsymbol{x}\mid x_0\right)\) or \(P\left(\boldsymbol{x}\mid z_0\right)\) is multivariate normally distributed. Consider \(P\left(\boldsymbol{x}\mid z_0\right)\sim N\left(\boldsymbol{x}; \boldsymbol{\mu}, \boldsymbol{V}\right)\) at first, where \(\boldsymbol{\mu}\) is the mean vector and \(\boldsymbol{V}\) is the variance-covariance matrix. Because current transcriptome data contain little information about the evolution from node \(Z\) to node \(O\), calculations of \(\boldsymbol{\mu}\) and \(\boldsymbol{V}\) are usually difficult because both depend on \(z_0\) and \(\tau\).

We propose a stationary OU model (sOU) that helpful to avoid these problems in practice, which postulates that, at the genome-wide level, the biological function of a tissue-specific transcriptome is conservative during the course of species evolution. Specifically, sOU involves two assumptions.

  • Origin of the tissue (node \(Z\) in Figure 6.1) was so ancient that the evolutionary time between nodes \(Z\) and \(O\) can be approximated by \(\tau \to \infty\). Consequently, the expression mean and variance at root \(O\) approach to \(\mu\) and \(\rho^{2} = 1/W\), respectively.
  • The optimal expression level (\(\mu\)) and the strength of expression conservation (\(W\)) remain virtually constant along the species phylogeny, that is, the expression mean and variances at all internal and external nodes are equal to \(\mu\) and \(1/W\), respectively.

Under the stationary OU model, \(P\left(\boldsymbol{x}\mid z_0\right)\) can be simplified as follows:

  • \(P\left(\boldsymbol{x}\mid z_0\right)\) is independent of \(z_0\);
  • the mean vector \(\boldsymbol{\mu}\) is uniform, i.e., \(\mu_1=...\mu_n\);
  • the variance-covariance matrix is simply given by \(\boldsymbol{V}=\boldsymbol{R}/W\), where \(\boldsymbol{R}\) is the coefficient of correlation matrix. Our intent is to estimate the strength of expression conservation of a gene, characterized by a single parameter \(W\). In this sense, the joint density of \(\boldsymbol{x}\) can be symbolically written by \(P\left(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{R},W\right)\). Together, we have

\[P\left(\boldsymbol{x}\mid x_0\right)=P\left(\boldsymbol{x}\mid z_0\right)\sim P\left(\boldsymbol{x};\ \boldsymbol{\mu},\ \boldsymbol{R},\ W\right)=N\left(\boldsymbol{x};\ \boldsymbol{\mu},\ \boldsymbol{R}/W\right)\tag{3.3}\]

The evolutionary scenario for comparative transcriptome analysis

Figure 6.1: The evolutionary scenario for comparative transcriptome analysis

6.1.3 Variation of \(W\) among genes

The sOU model assumes that the strength of expression conservation (\(W\)) of a gene remains a constant in species evolution but differs among genes. Substantial evidence has supported this argument(Bedford and Hartl 2009 Brawand et al. (2011)Cui et al. (2007)Park and Lehner (2013)Tirosh et al. (2006)Warnefors and Eyre-Walker (2012)Zou et al. (2011)). Further, we model \(W\) as a random variable that varies among genes according to a gamma distribution, that is,

\[\phi\left(W;\alpha,\overline{W}\right)=\frac{(\alpha/\overline{W})^\alpha}{\Gamma\left(\alpha\right)}W^{\alpha-1}e^{-\alpha W/\overline{W}}\tag{3.4}\] where \(\overline{W}\) is the mean and \(\alpha\) is the shape parameter; a small values of \(\alpha\) means a high degree of \(W\)-variation, and \(\alpha = \infty\) means a constant \(W\) among genes. After rewriting the joint normal density \(P\left(\boldsymbol{x};\ \boldsymbol{\mu},\ \boldsymbol{R},\ W\right)\) by \(P\left(\boldsymbol{x}\mid W;\ \boldsymbol{\mu},\ \boldsymbol{R},\right)\) to indicate \(W\) is a random variable, we have \[P\left(\boldsymbol{x}\mid W;\ \boldsymbol{\mu},\ \boldsymbol{R},\right)=N\left(\boldsymbol{x};\ \boldsymbol{\mu},\ \boldsymbol{R}/W\right)\tag{3.5}\] It follows that the marginal density of \(\boldsymbol{x}\) is given by \[ \begin{split} P\left(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{R},\alpha,\overline{W}\right)&=\int_0^{\infty}P\left(\boldsymbol{x}|W;\ \boldsymbol{\mu};\boldsymbol{R}\right)\phi\left(W;\ \alpha,\overline{W}\right)dW\\ &=A\left(\frac{\overline{W}}{\alpha}\right)^{n/2}\left(\frac{\Gamma\left(n/2+\alpha\right)}{\Gamma\left(\alpha\right)}\right)\left(\frac{\alpha}{\alpha+Q\left(\boldsymbol{x}\right)\overline{W}}\right)^{n/2+\alpha} \end{split}\tag{3.6} \] where \(Q\left(\boldsymbol{x}\right)=\left(\boldsymbol{x}-\boldsymbol{\mu}\right)^T\boldsymbol{R}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}\right)\) is a quadratic function of \(\boldsymbol{x}\), and \(A=\pi^{-\frac{n}{2}}\left|\boldsymbol{R}\right|^{-\frac{1}{2}}\) is a normalization constant.

6.1.4 An empirical Bayesian framework for gene-specific \(W\) esitimation

6.1.4.1 Posterior mean of W as gene-specific predictor

We adopt an empirical Bayesian procedure to predict the strength of expression conservation for single gene. By the Bayes rule, the posterior density of \(W\) conditional of the expression profile (\(\boldsymbol{x}\)) of a gene is given by

\[P\left(W \mid \boldsymbol{x};\boldsymbol{\mu},\boldsymbol{R},\alpha,\overline{W} \right) =\frac{\phi\left(W;\alpha,\overline{W}\right)P\left(\boldsymbol{x}|W;\boldsymbol{\mu},\boldsymbol{R}\right)}{P\left(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{R},\alpha,\overline{W}\right)}\tag{3.7}\] After some mathematical calculations, one can show that the analytical form of the posterior density of \(W\) is given by \[P\left(W \mid \boldsymbol{x};\boldsymbol{\mu},\boldsymbol{R},\alpha,\overline{W} \right) = A\left(\frac{\overline{W}}{\alpha}\right)^{n/2}\left(\frac{\Gamma\left(n/2+\alpha\right)}{\Gamma\left(\alpha\right)}\right)\left(\frac{\alpha}{\alpha+Q\left(\boldsymbol{x}\right)\overline{W}}\right)^{n/2+\alpha}\tag{3.8}\] Hence, \(P\left(W \mid \boldsymbol{x};\boldsymbol{\mu},\boldsymbol{R},\alpha,\overline{W} \right)\) follows a gamma distribution, with the mean and variance given by \[ \begin{split} E\left[W|\boldsymbol{x}\right]\ &=\left[\frac{\alpha+\frac{n}{2}}{\alpha+Q\left(x\right)\overline{W}}\right]\overline{W}\\ Var\left[W|\boldsymbol{x}\right]&=\left[\frac{\alpha+\frac{n}{2}}{\left(\alpha+Q\left(x\right)\overline{W}\right)^2}\right](\overline{W})^2 \end{split}\tag{3.9} \]

respectively. Of particular, the posterior mean, \(E\left[W|\boldsymbol{x}\right]\) can be used as the predictor for the strength of expression conservation of a gene with obseved expression profile \(\boldsymbol{x}\).

6.1.4.2 Relative strength of expression conservation

It has been realized that the strenth of expression conservation (\(W_k\)) of gene \(k\) highly depends on the normalization method used for RNA-seq raw reads count. Hence, it is difficult to compare between two sets of estimates when they used different normalization methods. To aleviate this problem, it is more convenient to use the ratio \(U_k=W_k/\overline{W}\), the relative strength of expression conservation. Suppose we have \(N\) orthologous genes under study, and the expression profile of the \(k\)-th gene is denoted by \(\boldsymbol{x}_k\), \(k=1,…N\). Let \(W_{k}=E\left[W|\boldsymbol{x}\right]\) be the posterior predictor for the strength of expression conservaton of gene \(k\). Notice that the expectation of the posterior mean prediction (\(E\left[W|\boldsymbol{x}\right]\)) with respect to the the maginal density \(P\left(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{R},\alpha,\overline{W}\right)\) is equal to the mean of the strength of expression conservation (\(\overline{W}\)), that is,

\[\int E\left[W\mid \boldsymbol{x}\right]P\left(\boldsymbol{x};\ \boldsymbol{\mu},\ \boldsymbol{R},\ \alpha,\ \overline{W}\right)d\boldsymbol{x}=\overline{W}\tag{3.10}\] Eq.(3.10) implies that the averge of \(U_k\), the relative strength of expression conservation over all genes is roughly to be one, that is,
\[\frac{\sum_{k=1}^NU_k}{N}\approx1\tag{3.11}\]

6.2 Statistical Procedure

Suppose that we have RNA-seq datasets of a particular tissue from \(n\) species, and the expression profile of each \(k\)-th gene denoted by \(\boldsymbol{x_k}=\left(x_{1k,\ }...x_{nk}\right)\), \(k=1...,N\). We developed a practically feasible procedure to estimate \(W\) of each gene, which actually deals with the quadratic function of \(\boldsymbol{x_{k}}\), or \(Q\left(\boldsymbol{x_k}\right)\). The procedure is briefly described below.

  1. Calculate gene-\(k\) specific mean (\(\mu_{k}\)) by a simple average over orthologous genes.
  2. Calculate the matrix of correlation coefficients (\(\boldsymbol{R}\)) from comparative RNA-seq data, which is applied to each of gene.
  3. Calculate the quadratic function of each gene \(k\), \(Q\left(x_k\right)\), by \[\hat{Q}\left(x_k\right)=\sum_{_{i=1}}^n\sum_{j=1}^nc_{ij}\left(x_{ik}-\mu_k\right)\left(x_{jk}-\mu_k\right)\tag{3.12}\] where \(c_{ij}\) is the \(ij\)-th element of matrix \(\boldsymbol{C}=\boldsymbol{R}^{-1}\), \(x_{ik}\) or \(x_{jk}\) is the expression value of gene \(k\) in species \(i\) or \(j\), respectively.
  4. Treating \(Q\left(\boldsymbol{x_k}\right)\) as the observation of gene \(k\) and rewriting Eq.(6), symbolically, by \(P\left(Q\left(\boldsymbol{x_k}\right);\alpha,\ \overline{W}\right)\), build up an approximate likelihood function \[\ell\left(X \ \mid \ \alpha,\ \overline{W}\right)\ = \prod_{k=1}^NP\left(\hat{Q}\left(\boldsymbol{x_k}\right);\ \alpha,\overline{W}\right)\] and obtain the maximum likelihood estimates (MLE) of \(\alpha\) and \(\overline{W}\); the standard likelihood ratio test is applied to test the null hypothesis of no \(W\)-variation among genes, i.e., \(\alpha = \infty\).
  5. Calculate \(W_{k}\), the empirical Bayesian estimate of the strength of expression conservation of gene \(k\), from Eq.(3.9) after replacing \(\alpha\) and \(\overline{W}\) by their estimates.

6.3 Case Study: Emprical Bayesian Esitimates of Expression Conservation of genes

We use the expression values of 5635 1:1 orthologous genes in brain of nine mammalian species to estimate the parameters of the selection pressure gamma distribution in brain. Then we estimate the gene-specific selection pressure based on Bayes’ theorem.

TreeExp can be loaded the package in the usual way:

library('TreeExp')

Let us first load the tetrapod expression dataset:

data('tetraExp')

6.3.1 Inversed correlation matrix

And then, based on the constructed taxaExp object, we are going to create an inverse correlation matrix between mammalian species from the taxaExp object:

species.group <- c("Human", "Chimpanzee", "Bonobo", "Gorilla",
"Macaque", "Mouse", "Opossum", "Platypus")
### all mammalian species

inv.corr.mat <- corrMatInv(tetraExp, taxa = species.group, subtaxa = "Brain")
inv.corr.mat
##                  Human_Brain Chimpanzee_Brain Bonobo_Brain Gorilla_Brain
## Human_Brain       15.7867153      -7.05168768   -0.5467655   -5.21892601
## Chimpanzee_Brain  -7.0516877      26.74465629  -13.4479799   -4.73381009
## Bonobo_Brain      -0.5467655     -13.44797986   18.3508826   -1.88760497
## Gorilla_Brain     -5.2189260      -4.73381009   -1.8876050   14.49822033
## Macaque_Brain     -1.9749665      -1.07677087   -1.3431113   -2.65851421
## Mouse_Brain       -0.6505556      -0.21643982   -0.3682007    0.27736989
## Opossum_Brain     -0.1345412      -0.31082672   -0.1962514   -0.02468375
## Platypus_Brain    -0.1113688      -0.01113911   -0.4318870   -0.01712743
##                  Macaque_Brain Mouse_Brain Opossum_Brain Platypus_Brain
## Human_Brain         -1.9749665  -0.6505556   -0.13454125    -0.11136876
## Chimpanzee_Brain    -1.0767709  -0.2164398   -0.31082672    -0.01113911
## Bonobo_Brain        -1.3431113  -0.3682007   -0.19625137    -0.43188695
## Gorilla_Brain       -2.6585142   0.2773699   -0.02468375    -0.01712743
## Macaque_Brain        9.9086664  -2.0792122   -0.50552772    -0.26332694
## Mouse_Brain         -2.0792122   5.1274351   -1.27896910    -0.61687114
## Opossum_Brain       -0.5055277  -1.2789691    3.90792798    -1.19664014
## Platypus_Brain      -0.2633269  -0.6168711   -1.19664014     3.02761767

6.3.2 Estimation of gamma parameters

Then we need to extract the expression values of orthologous genes from the taxaExp object using exptabTE function.

brain.exptable <- exptabTE(tetraExp, taxa = species.group, subtaxa = "Brain" ,logrithm = TRUE)
head(brain.exptable)
##                 Human_Brain Chimpanzee_Brain Bonobo_Brain Gorilla_Brain
## ENSG00000198824   4.6111724         5.029011     5.151778     4.1093606
## ENSG00000118402   5.3652726         5.597233     5.360715     4.9064096
## ENSG00000166167   6.6794801         6.687621     7.174127     6.4891254
## ENSG00000144724   4.8374387         4.356144     5.460087     4.2280490
## ENSG00000183508   0.9855004         1.327687     2.788686     0.9030383
## ENSG00000008086   4.9321557         4.516015     5.825277     4.7213727
##                 Macaque_Brain Mouse_Brain Opossum_Brain Platypus_Brain
## ENSG00000198824      5.861707    6.700994      7.218781       6.536675
## ENSG00000118402      5.909533    7.396091      7.014913       8.321252
## ENSG00000166167      7.304967    8.257011      7.593279       6.903038
## ENSG00000144724      5.852249    6.657211      7.164605       6.239551
## ENSG00000183508      1.395063    3.382667      3.240314       2.989139
## ENSG00000008086      7.000113    6.889352      7.595966       7.036943

With the inverse correlation matrix and expression values of brain tissue in 9 mammals, we are now able to estimate the parameters of the gamma distribution:

gamma.paras <- estParaGamma(exptable =brain.exptable, corrmatinv =inv.corr.mat)
## print the elements of gamma.paras
gamma.paras
## $alpha
## [1] 2.937128
## 
## $W_average
## [1] 0.2610562
## 
## $speNum
## [1] 8
## 
## $geneNum
## [1] 5636

The \(\bar{W}\) is the average of the selection pressure levels in the tissue brain. And the shape parameter \(\alpha\) here can reflect the internal variances of selection pressure. The more close \(\alpha\) is to 2, the more distinctive selection pressures on genes. And if the \(\alpha\) is close to infinite, it means there are no difference among selection pressures on genes.

6.3.3 Bayesian estimation of gene-specific selection pressure

After parameters of the gamma distribution are estimated, we are able to estimate posterior selection pressures as well as their se with given ‘RPKM’ values across species:

brain.Q <- estParaQ(brain.exptable, corrmatinv = inv.corr.mat)
# with prior expression values and inversed correlation matrix
    
brain.post<- estParaWBayesian(brain.Q, gamma.paras)
brain.W <- brain.post$w # posterior selection pressures
brain.CI <- brain.post$ci95 # posterior expression 95% confidence interval

After esitimating the Bayesian selection pressres, \(W\), we can chek a few gene with hightest selection pressure.

names(brain.W) <- rownames(brain.exptable)
head(sort(brain.W, decreasing = TRUE)) #check a few genes with highest seletion pressure
## ENSG00000137270 ENSG00000102243 ENSG00000139515 ENSG00000146378 
##        0.616582        0.616582        0.616582        0.616582 
## ENSG00000151379 ENSG00000111049 
##        0.616582        0.616582

and draw the dentisty plot of \(W\) among genes.

plot(density(brain.W))

References

Bedford, Trevor, and Daniel L. Hartl. 2009. “Optimization of gene expression by natural selection.” Proceedings of the National Academy of Sciences 106:1133–8. https://doi.org/10.1073/pnas.0812009106.

Brawand, David, Magali Soumillon, Anamaria Necsulea, Philippe Julien, Gábor Csárdi, Patrick Harrigan, Manuela Weier, et al. 2011. “The evolution of gene expression levels in mammalian organs.” Nature 478:343. https://doi.org/10.1038/nature10532.

Cui, Qinghua, Zhenbao Yu, Enrico O. Purisima, and Edwin Wang. 2007. “MicroRNA regulation and interspecific variation of gene expression.” Trends in Genetics 23:372–75. https://doi.org/10.1016/j.tig.2007.04.003.

Gu, Xun, and Zhixi Su. 2007. “Tissue-driven hypothesis of genomic evolution and sequence-expression correlations.” Proceedings of the National Academy of Sciences 104:2779–84. https://doi.org/10.1073/pnas.0610797104.

Hansen, Thomas F, and Emilia P Martins. 1996. “Translating Between Microevolutionary Process and Macroevolutionary Patterns: The Correlation Structure of Interspecific Data.” Evolution 50:1404. https://doi.org/10.2307/2410878.

Lemos, Bernardo, Colin D. Meiklejohn, Mario Cáceres, and Daniel L. Hartl. 2005. “Rate of divergence in gene expression profiles of primates, mice and flies: stabilizing selection and variability among functionnal categories.” Evolution 63:126. https://doi.org/10.1554/04-251.

Park, Solip, and Ben Lehner. 2013. “Epigenetic epistatic interactions constrain the evolution of gene expression.” Molecular Systems Biology 9:645. https://doi.org/10.1038/msb.2013.2.

Rohlfs, Rori V., Patrick Harrigan, and Rasmus Nielsen. 2014. “Modeling Gene Expression Evolution with an Extended Ornstein–Uhlenbeck Process Accounting for Within-Species Variation.” Molecular Biology and Evolution 31:201–11. https://doi.org/10.1093/molbev/mst190.

Tirosh, Itay, Adina Weinberger, Miri Carmi, and Naama Barkai. 2006. “A genetic signature of interspecies variations in gene expression.” Nature Genetics 38:ng1819. https://doi.org/10.1038/ng1819.

Warnefors, Maria, and Adam Eyre-Walker. 2012. “A Selection Index for Gene Expression Evolution and Its Application to the Divergence between Humans and Chimpanzees.” PLoS ONE 7:e34935. https://doi.org/10.1371/journal.pone.0034935.

Zou, Yangyun, Wei Huang, Zhenglong Gu, and Xun Gu. 2011. “Predominant Gain of Promoter TATA Box after Gene Duplication Associated with Stress Responses.” Molecular Biology and Evolution 28:2893–2904. https://doi.org/10.1093/molbev/msr116.