The Genotype-Tissue Expression (GTEx) project (Lonsdale et al. 2013) aims at measuring human tissue-specific gene expression levels. With the collected data, we will be able to explore the landscape of gene expression, gene regulation, and their deep connections with genetic variations.
Raw GTEx data contains expression measurements from various types of elements (such as genes, pseudogenes, noncoding DNA sequences) covering the whole genome. For some analysis, it might be desirable to only keep a subset of the data, for example, data from protein coding genes. In such cases, mapping the original Ensembl gene IDs to Entrez gene IDs or HGNC symbols become an essential step in the analysis pipeline.
grex
offers a minimal dependency
solution to do such ID mappings. Currently, an Ensembl ID from GTEx can
be mapped to its Entrez gene ID, HGNC gene symbol, and UniProt ID, with
basic annotation information such as HGNC gene name, cytogenetic
location, and gene type. We also limit our scope on the Ensembl IDs
appeared in the gene read count data. Ensembl IDs from transcript data
will be considered in future versions.
To facilitate such ID conversion tasks, the grex
package
has a built-in mapping table derived from the well-known annotation data
package org.Hs.eg.db
(Carlson
2015). The mapping data we used has integrated mapping
information from Ensembl and NCBI, to maximize the possibility of
finding a matched Entrez ID. The R script for creating the mapping table
is located here.
Not surprisingly, when creating such a table, there were hundreds of cases where a single Ensembl ID can be mapped to multiple Entrez gene IDs. To create a one-to-one mapping, we took a simple approach: we just removed the duplicated Entrez IDs and only kept the first we encountered in the original database. Therefore, there might be cases where the mapping is not 100% accurate. If you have such doubts for particular results, please try searching the original ID on the Ensembl website and see if we got a correct mapped ID.
As an example, we use the Ensembl IDs from GTEx V7 gene count data and select 100 IDs:
library("grex")
data("gtexv7")
id <- gtexv7[101:200]
df <- grex(id)
tail(df)
#> ensembl_id entrez_id hgnc_symbol hgnc_name
#> 95 ENSG00000266075 <NA> <NA> <NA>
#> 96 ENSG00000272153 <NA> <NA> <NA>
#> 97 ENSG00000116198 9731 CEP104 centrosomal protein 104
#> 98 ENSG00000169598 1677 DFFB DNA fragmentation factor subunit beta
#> 99 ENSG00000198912 339448 C1orf174 chromosome 1 open reading frame 174
#> 100 ENSG00000236423 100133612 LINC01134 long intergenic non-protein coding RNA 1134
#> cyto_loc uniprot_id gene_biotype
#> 95 <NA> <NA> <NA>
#> 96 <NA> <NA> <NA>
#> 97 1p36.32 O60308 protein_coding
#> 98 1p36.32 B4DZS0 protein_coding
#> 99 1p36.32 Q8IYL3 protein_coding
#> 100 1p36.32 <NA> lincRNA
The elements which cannot be mapped accurately will be
NA
.
Genes with a mapped Entrez ID:
filtered_genes <-
df[
!is.na(df$"entrez_id"),
c("ensembl_id", "entrez_id", "hgnc_symbol", "gene_biotype")
]
head(filtered_genes)
#> ensembl_id entrez_id hgnc_symbol gene_biotype
#> 1 ENSG00000162576 54587 MXRA8 protein_coding
#> 2 ENSG00000175756 54998 AURKAIP1 protein_coding
#> 4 ENSG00000221978 81669 CCNL2 protein_coding
#> 5 ENSG00000224870 148413 MRPL20-AS1 processed_transcript
#> 6 ENSG00000242485 55052 MRPL20 protein_coding
#> 8 ENSG00000235098 441869 ANKRD65 protein_coding
If you want to start from the raw GENCODE gene IDs provided by GTEx
(e.g. ENSG00000227232.4
), the function
cleanid()
can help you remove the .version
part in them, to produce Ensembl IDs.
Conventionally, the next step is removing (or imputing) the genes
with NA
IDs, and then select the genes to keep. Notably, as
was observed in the complete gene read count data, in about 100 cases,
multiple Ensembl IDs can be mapped to one single Entrez ID.
Post-processing steps may also be needed for such genes.
We thank members of the Stephens lab (Kushal K Dey, Michael Turchin) for their valuable suggestions and helpful discussions on this problem.