Chapter 3 Data Manipulation and Storage

3.1 Input Format

To simplify the use of package, the updated TreeExp has gave up the former input format (which require both gene information file and reads count data) and takes in only normalized RNA-seq data as input file. In other words, users should make sure that the input data is processed and comparable between samples. The package is not likely to provied data filteration or normalization functions.

The expression data is suppossed to be in certain format:

expression file should be a text file in matrix shape, in which values are separated by tabs. Rows correspond to orthologous genes names, and Columns correspond to sample names. Sample names are in format of “TaxaName_SubtaxaName_ReplicatesName”. Usually, TaxaName represents name of species; Subtaxaname correspond to cetain tissue, celltype or develomental stage; ReplicatesName shows the name of replicates for each Taxa_Subtaxa-pair. The three types of lables, TaxaName,SubtaxaName, ReplicatesName are connected by ‘_’ character.
raw reads count data should be first normalized, e.g., by RPKM. While RPKM is simple and straightforward, it tends to be unstable when the number of genes expressed across samples differs considerably. This problem can be alleviated by the TPM measure, which has been widely usued. Some statistically sophisticated normalization methods such as TMM, and median ratio normalization, has become the built-in standard in many bioinformatics tools for RNA-seq analysis(Robinson, McCarthy, and Smyth 2010).

3.2 Example of Input Data

The example file are included in the TreeExp package, which can be found in /inst/extdata folder in the package.

Here, we select expression values of 100 orthologous genes in eight brain regions (CB, HIP, STR, ACC, V1C, PMC, DPFC, VPFC)¹ among human, chimpanzee, gorilla and gibbon(Xu et al. 2018). The numbers of biological replicates for each of the brain regions in species are 2~6, except only one replicate for all brain regions in gibbon. Note that expression data here are only used as demonstration of how functions in package store, manipulate and print the data input, and should not be used in further phylogenetic analysis since too few genes were included in the file.

The Table below shows the format of the partly input data.

Gene	Human_DPFC_Hs3	Human_STR_Hs8	Chimpanzee_ACC_REIKO	Gorilla_CB_GON
ENSG00000000003	10.5	1.8	3.7	0.9
ENSG00000000005	0.0	0.0	0.0	0.0
ENSG00000000419	33.7	34.1	17.3	19.9
ENSG00000000457	1.3	1.9	1.2	3.7
ENSG00000000460	0.5	0.7	0.6	1.3

3.3 Construction

The construction function TEconstruct loads in expression level file, and wraps it in a list of taxonExp objects (one taxaExp object).

taxa.objects = TEconstruct(ExpValueFP = system.file('extdata/primate_brain_expvalues.txt',
package = 'TreeExp'), taxa = "all", subtaxa = 'all')

The construction process takes several minutes on a desktop computer depending on data size and hardware performance. Specify “taxa” and “subtaxa” options in the function when using partial of your data. The construction process will be faster.

taxa.objects = TEconstruct(ExpValueFP = system.file('extdata/primate_brain_expvalues.txt',
package = 'TreeExp'), taxa = "all", subtaxa = c("ACC","CB"))

You can take a look at what the loaded objects:

print(taxa.objects, details = TRUE)

## 
##  8 taxonExp objects 
## 
## object 1 : Human      ACC 
## object 2 : Human      CB 
## object 3 : Chimpanzee     ACC 
## object 4 : Chimpanzee     CB 
## object 5 : Gorilla    ACC 
## object 6 : Gorilla    CB 
## object 7 : Gibbon     ACC 
## object 8 : Gibbon     CB

Also, you can choose to print a single taxonExp object

print(taxa.objects[[1]], printlen = 6)

## 
## One taxonExp object
## Taxon name:  Human 
## Subtaxon name:  ACC 
## Total gene number:  200 
## Total bio replicates number:  5 
## Bio replicates titles:
## [1] "Hs1" "Hs6" "Hs8" "Hs2" "Hs5"

and choose to print single element (exp_val(expression values) element as example) for the taxonExp object

taxa.objects[[6]]$exp_value[1:5,]

	Gorilla_CB_Sakura	Gorilla_CB_GON
ENSG00000000003	2.6	0.9
ENSG00000000005	0.0	0.0
ENSG00000000419	13.5	19.9
ENSG00000000457	4.1	3.7
ENSG00000000460	1.3	1.3

Once the contruction courcs successfully completed, the following transcriptome phylogenetic analysis are ready to go.

References

Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26:139–40. https://doi.org/10.1093/bioinformatics/btp616.

Xu, Chuan, Qian Li, Olga Efimova, Liu He, Shoji Tatsumoto, Vita Stepanova, Takao Oishi, et al. 2018. “Human-specific features of spatial gene expression and regulation in eight brain regions.” Genome Research 28:1097–1110. https://doi.org/10.1101/gr.231357.117.

cerebellum (CB), hippocampus (HIP), striatum (STR) , anterior cingulate cortex (ACC), primary visual cortex (V1C), premotor cortex (PMC), dorsolateral prefrontal cortex (DPFC), ventrolateral prefrontal cortex (VPFC).↩