Title: | Generating Various Numerical Representation Schemes for Protein Sequences |
---|---|
Description: | Comprehensive toolkit for generating various numerical features of protein sequences described in Xiao et al. (2015) <DOI:10.1093/bioinformatics/btv042>. For full functionality, the software 'ncbi-blast+' is needed, see <https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html> for more information. |
Authors: | Nan Xiao [aut, cre] , Qing-Song Xu [aut], Dong-Sheng Cao [aut], Sebastian Mueller [ctb] (Alva Genomics) |
Maintainer: | Nan Xiao <[email protected]> |
License: | BSD_3_clause + file LICENSE |
Version: | 1.7-4 |
Built: | 2024-11-09 05:24:04 UTC |
Source: | https://github.com/nanxstats/protr |
This dataset includes the 2D autocorrelations descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AA2DACOR)
data(AA2DACOR)
This dataset includes the 3D-MoRSE descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AA3DMoRSE)
data(AA3DMoRSE)
This dataset includes the atom-centred fragments descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAACF)
data(AAACF)
BLOSUM100 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
data(AABLOSUM100)
data(AABLOSUM100)
BLOSUM45 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
data(AABLOSUM45)
data(AABLOSUM45)
BLOSUM50 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
data(AABLOSUM50)
data(AABLOSUM50)
BLOSUM62 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
data(AABLOSUM62)
data(AABLOSUM62)
BLOSUM80 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
data(AABLOSUM80)
data(AABLOSUM80)
This dataset includes the Burden eigenvalues descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AABurden)
data(AABurden)
This dataset includes the connectivity indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAConn)
data(AAConn)
This dataset includes the constitutional descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAConst)
data(AAConst)
This dataset includes the CPSA descriptors of the 20 amino acids calculated by Discovery Studio (version 2.5) used for scales extraction in this package.
All amino acid molecules had also been optimized with MOE 2011.10
(semiempirical AM1) before calculating these CPSA descriptors.
The SDF file containing the information of the optimized amino acid
molecules is included in this package. See OptAA3d
for more information.
data(AACPSA)
data(AACPSA)
This dataset includes all the 2D descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AADescAll)
data(AADescAll)
This dataset includes the edge adjacency indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAEdgeAdj)
data(AAEdgeAdj)
This dataset includes the eigenvalue-based indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAEigIdx)
data(AAEigIdx)
This dataset includes the functional group counts descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAFGC)
data(AAFGC)
This dataset includes the geometrical descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAGeom)
data(AAGeom)
This dataset includes the GETAWAY descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAGETAWAY)
data(AAGETAWAY)
The data was extracted from the AAindex1 database ver 9.1 (https://www.genome.jp/ftp/db/community/aaindex/old/ver9.1/aaindex1) as of November, 2012 (data last modified on 2006-08-14).
With this dataset, users can investigate each property's accession number and other details. See https://www.genome.jp/aaindex/ for more information.
data(AAindex)
data(AAindex)
This dataset includes the information indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAInfo)
data(AAInfo)
This dataset includes the meta information of the 20 amino acids used for the 2D and 3D descriptor calculation in this package. Each column represents:
AAName
Amino acid name
Short
One-letter representation
Abbreviation
Three-letter representation
mol
SMILES representation
PUBCHEM_COMPOUND_CID
PubChem CID for the amino acid
PUBCHEM_LINK
PubChem link for the amino acid
data(AAMetaInfo)
data(AAMetaInfo)
This dataset includes the 2D descriptors of the 20 amino acids calculated by MOE 2011.10 used for scales extraction in this package.
data(AAMOE2D)
data(AAMOE2D)
This dataset includes the 3D descriptors of the 20 amino acids
calculated by MOE 2011.10 used for scales extraction in this package.
All amino acid molecules had also been optimized with MOE (semiempirical AM1)
before calculating these 3D descriptors.
The SDF file containing the information of the optimized amino acid molecules
is included in this package. See OptAA3d
for more information.
data(AAMOE3D)
data(AAMOE3D)
This dataset includes the molecular properties descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAMolProp)
data(AAMolProp)
PAM120 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
data(AAPAM120)
data(AAPAM120)
PAM250 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
data(AAPAM250)
data(AAPAM250)
PAM30 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
data(AAPAM30)
data(AAPAM30)
PAM40 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
data(AAPAM40)
data(AAPAM40)
PAM70 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
data(AAPAM70)
data(AAPAM70)
This dataset includes the Randic molecular profiles descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AARandic)
data(AARandic)
This dataset includes the RDF descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AARDF)
data(AARDF)
This dataset includes the topological descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AATopo)
data(AATopo)
This dataset includes the topological charge indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AATopoChg)
data(AATopoChg)
This dataset includes the walk and path counts descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAWalk)
data(AAWalk)
This dataset includes the WHIM descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
data(AAWHIM)
data(AAWHIM)
This function calculates the auto covariance and auto cross covariance for generating scale-based descriptors of the same length.
acc(mat, lag)
acc(mat, lag)
mat |
A |
lag |
The lag parameter. Must be less than the amino acids. |
A length lag * p^2
named vector, the element names are
constructed by: the scales index (crossed scales index) and lag index.
Please see the references for details about auto cross covariance.
Nan Xiao <https://nanx.me>
Wold, S., Jonsson, J., Sjorstrom, M., Sandberg, M., & Rannar, S. (1993). DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures. Analytica chimica acta, 277(2), 239–253.
Sjostrom, M., Rannar, S., & Wieslander, A. (1995). Polypeptide sequence property relationships in Escherichia coli based on auto cross covariances. Chemometrics and intelligent laboratory systems, 29(2), 295–305.
See extractScales
for scales-based descriptors.
For more details, see extractDescScales
and extractProtFP
.
p <- 8 # p is the scales number n <- 200 # n is the amino acid number lag <- 7 # the lag paramter mat <- matrix(rnorm(p * n), nrow = p, ncol = n) acc(mat, lag)
p <- 8 # p is the scales number n <- 200 # n is the amino acid number lag <- 7 # the lag paramter mat <- matrix(rnorm(p * n), nrow = p, ncol = n) acc(mat, lag)
Parallel calculation of protein sequence similarity based on sequence alignment between two sets of protein sequences.
crossSetSim( protlist1, protlist2, type = "local", cores = 2, batches = 1, verbose = FALSE, submat = "BLOSUM62", gap.opening = 10, gap.extension = 4 )
crossSetSim( protlist1, protlist2, type = "local", cores = 2, batches = 1, verbose = FALSE, submat = "BLOSUM62", gap.opening = 10, gap.extension = 4 )
protlist1 |
A length |
protlist2 |
A length |
type |
Type of alignment, default is |
cores |
Integer. The number of CPU cores to use for parallel execution,
default is |
batches |
Integer. How many batches should we split the similarity computations into. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and fit all the similarities into a single batch. Defaults to 1. |
verbose |
Print the computation progress?
Useful when |
submat |
Substitution matrix, default is |
gap.opening |
The cost required to open a gap of any length in the alignment. Defaults to 10. |
gap.extension |
The cost to extend the length of an existing gap by 1. Defaults to 4. |
A n
x m
similarity matrix.
Sebastian Mueller <https://alva-genomics.com>
## Not run: # Be careful when testing this since it involves parallelization # and might produce unpredictable results in some environments library("Biostrings") library("foreach") library("doParallel") s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]] s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]] s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]] s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]] plist1 <- list(s1 = s1, s2 = s2, s4 = s4) plist2 <- list(s3 = s3, s4_again = s4, s5 = s5, s1_again = s1) psimmat <- crossSetSim(plist1, plist2) colnames(psimmat) <- names(plist1) rownames(psimmat) <- names(plist2) print(psimmat) # s1 s2 s4 # s3 0.10236985 0.18858241 0.05819984 # s4_again 0.04921696 0.12124217 1.00000000 # s5 0.03943488 0.06391103 0.05714638 # s1_again 1.00000000 0.11825938 0.04921696 ## End(Not run)
## Not run: # Be careful when testing this since it involves parallelization # and might produce unpredictable results in some environments library("Biostrings") library("foreach") library("doParallel") s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]] s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]] s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]] s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]] plist1 <- list(s1 = s1, s2 = s2, s4 = s4) plist2 <- list(s3 = s3, s4_again = s4, s5 = s5, s1_again = s1) psimmat <- crossSetSim(plist1, plist2) colnames(psimmat) <- names(plist1) rownames(psimmat) <- names(plist2) print(psimmat) # s1 s2 s4 # s3 0.10236985 0.18858241 0.05819984 # s4_again 0.04921696 0.12124217 1.00000000 # s5 0.03943488 0.06391103 0.05714638 # s1_again 1.00000000 0.11825938 0.04921696 ## End(Not run)
Parallel calculation of protein sequence similarity based on sequence alignment between two sets of protein sequences. This version offloads the partial results in each batch to the hard drive and merges the results together in the end, which reduces the memory usage.
crossSetSimDisk( protlist1, protlist2, cores = 2, batches = 1, path = tempdir(), verbose = FALSE, type = "local", submat = "BLOSUM62", gap.opening = 10, gap.extension = 4 )
crossSetSimDisk( protlist1, protlist2, cores = 2, batches = 1, path = tempdir(), verbose = FALSE, type = "local", submat = "BLOSUM62", gap.opening = 10, gap.extension = 4 )
protlist1 |
A length |
protlist2 |
A length |
cores |
Integer. The number of CPU cores to use for parallel execution,
default is |
batches |
Integer. How many batches should we split the pairwise similarity computations into. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and fit all the pairwise similarities into a single batch. Defaults to 1. |
path |
Directory for caching the results in each batch. Defaults to the temporary directory. |
verbose |
Print the computation progress?
Useful when |
type |
Type of alignment, default is |
submat |
Substitution matrix, default is |
gap.opening |
The cost required to open a gap of any length in the alignment. Defaults to 10. |
gap.extension |
The cost to extend the length of an existing gap by 1. Defaults to 4. |
A n
x m
similarity matrix.
Nan Xiao <https://nanx.me>
See crossSetSim
for the in-memory version.
## Not run: # Be careful when testing this since it involves parallelization # and might produce unpredictable results in some environments library("Biostrings") library("foreach") library("doParallel") s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]] s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]] s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]] s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]] set.seed(1010) plist1 <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 100, replace = TRUE)]) plist2 <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 100, replace = TRUE)]) psimmat <- crossSetSimDisk( plist1, plist2, cores = 2, batches = 10, verbose = TRUE, type = "local", submat = "BLOSUM62" ) ## End(Not run)
## Not run: # Be careful when testing this since it involves parallelization # and might produce unpredictable results in some environments library("Biostrings") library("foreach") library("doParallel") s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]] s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]] s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]] s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]] set.seed(1010) plist1 <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 100, replace = TRUE)]) plist2 <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 100, replace = TRUE)]) psimmat <- crossSetSimDisk( plist1, plist2, cores = 2, batches = 10, verbose = TRUE, type = "local", submat = "BLOSUM62" ) ## End(Not run)
This function calculates the Amino Acid Composition descriptor (dim: 20).
extractAAC(x)
extractAAC(x)
x |
A character vector, as the input protein sequence. |
A length 20 named vector
Nan Xiao <https://nanx.me>
M. Bhasin, G. P. S. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. Journal of Biological Chemistry, 2004, 279, 23262.
See extractDC
and extractTC
for Dipeptide Composition and Tripeptide Composition descriptors.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractAAC(x)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractAAC(x)
This function calculates the Amphiphilic Pseudo Amino Acid
Composition (APseAAC, or APAAC) descriptor (dim: 20 + (n * lambda)
,
n
is the number of properties selected, default is 80).
extractAPAAC( x, props = c("Hydrophobicity", "Hydrophilicity"), lambda = 30, w = 0.05, customprops = NULL )
extractAPAAC( x, props = c("Hydrophobicity", "Hydrophilicity"), lambda = 30, w = 0.05, customprops = NULL )
x |
A character vector, as the input protein sequence. |
props |
A character vector, specifying the properties used. 2 properties are used by default, as listed below:
|
lambda |
The lambda parameter for the APAAC descriptors, default is 30. |
w |
The weighting factor, default is 0.05. |
customprops |
A |
A length 20 + n * lambda
named vector,
n
is the number of properties selected.
Note the default 20 * 2
prop
values have already been
independently given in the function. Users can also specify
other (up to 544) properties with the Accession Number in
the AAindex
data, with or without the default
three properties, which means users should explicitly specify
the properties to use. For this descriptor type, users need to
intelligently evaluate the underlying details of the descriptors
provided, instead of using this function with their data blindly.
It would be wise to use some negative and positive control comparisons
where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246-255.
Kuo-Chen Chou. Using Amphiphilic Pseudo Amino Acid Composition to Predict Enzyme Subfamily Classes. Bioinformatics, 2005, 21, 10-19.
JACS, 1962, 84: 4240-4246. (C. Tanford). (The hydrophobicity data)
PNAS, 1981, 78:3824-3828 (T.P.Hopp & K.R.Woods). (The hydrophilicity data)
See extractPAAC
for the
pseudo amino acid composition (PseAAC) descriptor.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractAPAAC(x) myprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43) ) # use 2 default properties, 4 properties from the # AAindex database, and 3 cutomized properties extractAPAAC( x, customprops = myprops, props = c( "Hydrophobicity", "Hydrophilicity", "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ) )
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractAPAAC(x) myprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43) ) # use 2 default properties, 4 properties from the # AAindex database, and 3 cutomized properties extractAPAAC( x, customprops = myprops, props = c( "Hydrophobicity", "Hydrophilicity", "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ) )
This function calculates BLOSUM matrix-derived descriptors.
For users' convenience, protr
provides the BLOSUM45, BLOSUM50,
BLOSUM62, BLOSUM80, BLOSUM100, PAM30, PAM40, PAM70, PAM120, and PAM250
matrices for the 20 amino acids to select from.
extractBLOSUM(x, submat = "AABLOSUM62", k, lag, scale = TRUE, silent = TRUE)
extractBLOSUM(x, submat = "AABLOSUM62", k, lag, scale = TRUE, silent = TRUE)
x |
A character vector, as the input protein sequence. |
submat |
Substitution matrix for the 20 amino acids. Should be one of
|
k |
Integer. The number of selected scales (i.e. the first
|
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the substitution matrix
( |
silent |
Logical. Whether we print the relative importance of
each scales (diagnal value of the eigen decomposition result matrix B)
or not. Default is |
A length lag * p^2
named vector, p
is the number
of scales selected.
Nan Xiao <https://nanx.me>
Georgiev, A. G. (2009). Interpretable numerical descriptors of amino acid space. Journal of Computational Biology, 16(5), 703–723.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] blosum <- extractBLOSUM(x, submat = "AABLOSUM62", k = 5, lag = 7, scale = TRUE, silent = FALSE)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] blosum <- extractBLOSUM(x, submat = "AABLOSUM62", k = 5, lag = 7, scale = TRUE, silent = FALSE)
This function calculates the Composition descriptor of the CTD descriptors (dim: 21).
extractCTDC(x)
extractCTDC(x)
x |
A character vector, as the input protein sequence. |
A length 21 named vector
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
Inna Dubchak, Ilya Muchink, Stephen R. Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences. USA, 1995, 92, 8700-8704.
Inna Dubchak, Ilya Muchink, Christopher Mayor, Igor Dralyuk and Sung-Hou Kim. Recognition of a Protein Fold in the Context of the SCOP classification. Proteins: Structure, Function and Genetics, 1999, 35, 401-407.
See extractCTDT
and extractCTDD
for Transition and Distribution of the CTD descriptors.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractCTDC(x)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractCTDC(x)
This function calculates the Composition descriptor of the CTD descriptors, with customized amino acid classification support.
extractCTDCClass(x, aagroup1, aagroup2, aagroup3)
extractCTDCClass(x, aagroup1, aagroup2, aagroup3)
x |
A character vector, as the input protein sequence. |
aagroup1 |
A named list which contains the first group of customized amino acid classification. See example below. |
aagroup2 |
A named list which contains the second group of customized amino acid classification. See example below. |
aagroup3 |
A named list which contains the third group of customized amino acid classification. See example below. |
A length k * 3
named vector, k
is the number of
amino acid properties used.
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
Inna Dubchak, Ilya Muchink, Stephen R. Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences. USA, 1995, 92, 8700-8704.
Inna Dubchak, Ilya Muchink, Christopher Mayor, Igor Dralyuk and Sung-Hou Kim. Recognition of a Protein Fold in the Context of the SCOP classification. Proteins: Structure, Function and Genetics, 1999, 35, 401-407.
See extractCTDTClass
and
extractCTDDClass
for Transition and Distribution
of the CTD descriptors with customized amino acid classification support.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] # using five customized amino acid property classification group1 <- list( "hydrophobicity" = c("R", "K", "E", "D", "Q", "N"), "normwaalsvolume" = c("G", "A", "S", "T", "P", "D", "C"), "polarizability" = c("G", "A", "S", "D", "T"), "secondarystruct" = c("E", "A", "L", "M", "Q", "K", "R", "H"), "solventaccess" = c("A", "L", "F", "C", "G", "I", "V", "W") ) group2 <- list( "hydrophobicity" = c("G", "A", "S", "T", "P", "H", "Y"), "normwaalsvolume" = c("N", "V", "E", "Q", "I", "L"), "polarizability" = c("C", "P", "N", "V", "E", "Q", "I", "L"), "secondarystruct" = c("V", "I", "Y", "C", "W", "F", "T"), "solventaccess" = c("R", "K", "Q", "E", "N", "D") ) group3 <- list( "hydrophobicity" = c("C", "L", "V", "I", "M", "F", "W"), "normwaalsvolume" = c("M", "H", "K", "F", "R", "Y", "W"), "polarizability" = c("K", "M", "H", "F", "R", "Y", "W"), "secondarystruct" = c("G", "N", "P", "S", "D"), "solventaccess" = c("M", "S", "P", "T", "H", "Y") ) extractCTDCClass(x, aagroup1 = group1, aagroup2 = group2, aagroup3 = group3)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] # using five customized amino acid property classification group1 <- list( "hydrophobicity" = c("R", "K", "E", "D", "Q", "N"), "normwaalsvolume" = c("G", "A", "S", "T", "P", "D", "C"), "polarizability" = c("G", "A", "S", "D", "T"), "secondarystruct" = c("E", "A", "L", "M", "Q", "K", "R", "H"), "solventaccess" = c("A", "L", "F", "C", "G", "I", "V", "W") ) group2 <- list( "hydrophobicity" = c("G", "A", "S", "T", "P", "H", "Y"), "normwaalsvolume" = c("N", "V", "E", "Q", "I", "L"), "polarizability" = c("C", "P", "N", "V", "E", "Q", "I", "L"), "secondarystruct" = c("V", "I", "Y", "C", "W", "F", "T"), "solventaccess" = c("R", "K", "Q", "E", "N", "D") ) group3 <- list( "hydrophobicity" = c("C", "L", "V", "I", "M", "F", "W"), "normwaalsvolume" = c("M", "H", "K", "F", "R", "Y", "W"), "polarizability" = c("K", "M", "H", "F", "R", "Y", "W"), "secondarystruct" = c("G", "N", "P", "S", "D"), "solventaccess" = c("M", "S", "P", "T", "H", "Y") ) extractCTDCClass(x, aagroup1 = group1, aagroup2 = group2, aagroup3 = group3)
This function calculates the Distribution descriptor of the CTD descriptors (dim: 105).
extractCTDD(x)
extractCTDD(x)
x |
A character vector, as the input protein sequence. |
A length 105 named vector
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
Inna Dubchak, Ilya Muchink, Stephen R. Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences. USA, 1995, 92, 8700-8704.
Inna Dubchak, Ilya Muchink, Christopher Mayor, Igor Dralyuk and Sung-Hou Kim. Recognition of a Protein Fold in the Context of the SCOP classification. Proteins: Structure, Function and Genetics, 1999, 35, 401-407.
See extractCTDC
and extractCTDT
for Composition and Transition of the CTD descriptors.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractCTDD(x)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractCTDD(x)
This function calculates the Distribution descriptor of the CTD descriptors, with customized amino acid classification support.
extractCTDDClass(x, aagroup1, aagroup2, aagroup3)
extractCTDDClass(x, aagroup1, aagroup2, aagroup3)
x |
A character vector, as the input protein sequence. |
aagroup1 |
A named list which contains the first group of customized amino acid classification. See example below. |
aagroup2 |
A named list which contains the second group of customized amino acid classification. See example below. |
aagroup3 |
A named list which contains the third group of customized amino acid classification. See example below. |
A length k * 15
named vector, k
is the number of
amino acid properties used.
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
Inna Dubchak, Ilya Muchink, Stephen R. Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences. USA, 1995, 92, 8700-8704.
Inna Dubchak, Ilya Muchink, Christopher Mayor, Igor Dralyuk and Sung-Hou Kim. Recognition of a Protein Fold in the Context of the SCOP classification. Proteins: Structure, Function and Genetics, 1999, 35, 401-407.
See extractCTDCClass
and
extractCTDTClass
for Composition and Transition of
the CTD descriptors with customized amino acid classification support.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] # using five customized amino acid property classification group1 <- list( "hydrophobicity" = c("R", "K", "E", "D", "Q", "N"), "normwaalsvolume" = c("G", "A", "S", "T", "P", "D", "C"), "polarizability" = c("G", "A", "S", "D", "T"), "secondarystruct" = c("E", "A", "L", "M", "Q", "K", "R", "H"), "solventaccess" = c("A", "L", "F", "C", "G", "I", "V", "W") ) group2 <- list( "hydrophobicity" = c("G", "A", "S", "T", "P", "H", "Y"), "normwaalsvolume" = c("N", "V", "E", "Q", "I", "L"), "polarizability" = c("C", "P", "N", "V", "E", "Q", "I", "L"), "secondarystruct" = c("V", "I", "Y", "C", "W", "F", "T"), "solventaccess" = c("R", "K", "Q", "E", "N", "D") ) group3 <- list( "hydrophobicity" = c("C", "L", "V", "I", "M", "F", "W"), "normwaalsvolume" = c("M", "H", "K", "F", "R", "Y", "W"), "polarizability" = c("K", "M", "H", "F", "R", "Y", "W"), "secondarystruct" = c("G", "N", "P", "S", "D"), "solventaccess" = c("M", "S", "P", "T", "H", "Y") ) extractCTDDClass(x, aagroup1 = group1, aagroup2 = group2, aagroup3 = group3)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] # using five customized amino acid property classification group1 <- list( "hydrophobicity" = c("R", "K", "E", "D", "Q", "N"), "normwaalsvolume" = c("G", "A", "S", "T", "P", "D", "C"), "polarizability" = c("G", "A", "S", "D", "T"), "secondarystruct" = c("E", "A", "L", "M", "Q", "K", "R", "H"), "solventaccess" = c("A", "L", "F", "C", "G", "I", "V", "W") ) group2 <- list( "hydrophobicity" = c("G", "A", "S", "T", "P", "H", "Y"), "normwaalsvolume" = c("N", "V", "E", "Q", "I", "L"), "polarizability" = c("C", "P", "N", "V", "E", "Q", "I", "L"), "secondarystruct" = c("V", "I", "Y", "C", "W", "F", "T"), "solventaccess" = c("R", "K", "Q", "E", "N", "D") ) group3 <- list( "hydrophobicity" = c("C", "L", "V", "I", "M", "F", "W"), "normwaalsvolume" = c("M", "H", "K", "F", "R", "Y", "W"), "polarizability" = c("K", "M", "H", "F", "R", "Y", "W"), "secondarystruct" = c("G", "N", "P", "S", "D"), "solventaccess" = c("M", "S", "P", "T", "H", "Y") ) extractCTDDClass(x, aagroup1 = group1, aagroup2 = group2, aagroup3 = group3)
This function calculates the Transition descriptor of the CTD descriptors (dim: 21).
extractCTDT(x)
extractCTDT(x)
x |
A character vector, as the input protein sequence. |
A length 21 named vector
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
Inna Dubchak, Ilya Muchink, Stephen R. Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences. USA, 1995, 92, 8700-8704.
Inna Dubchak, Ilya Muchink, Christopher Mayor, Igor Dralyuk and Sung-Hou Kim. Recognition of a Protein Fold in the Context of the SCOP classification. Proteins: Structure, Function and Genetics, 1999, 35, 401-407.
See extractCTDC
and extractCTDD
for Composition and Distribution of the CTD descriptors.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractCTDT(x)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractCTDT(x)
This function calculates the Transition descriptor of the CTD descriptors, with customized amino acid classification support.
extractCTDTClass(x, aagroup1, aagroup2, aagroup3)
extractCTDTClass(x, aagroup1, aagroup2, aagroup3)
x |
A character vector, as the input protein sequence. |
aagroup1 |
A named list which contains the first group of customized amino acid classification. See example below. |
aagroup2 |
A named list which contains the second group of customized amino acid classification. See example below. |
aagroup3 |
A named list which contains the third group of customized amino acid classification. See example below. |
A length k * 3
named vector, k
is the number of
amino acid properties used.
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
Inna Dubchak, Ilya Muchink, Stephen R. Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences. USA, 1995, 92, 8700-8704.
Inna Dubchak, Ilya Muchink, Christopher Mayor, Igor Dralyuk and Sung-Hou Kim. Recognition of a Protein Fold in the Context of the SCOP classification. Proteins: Structure, Function and Genetics, 1999, 35, 401-407.
See extractCTDCClass
and
extractCTDDClass
for Composition and Distribution of
the CTD descriptors with customized amino acid classification support.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] # using five customized amino acid property classification group1 <- list( "hydrophobicity" = c("R", "K", "E", "D", "Q", "N"), "normwaalsvolume" = c("G", "A", "S", "T", "P", "D", "C"), "polarizability" = c("G", "A", "S", "D", "T"), "secondarystruct" = c("E", "A", "L", "M", "Q", "K", "R", "H"), "solventaccess" = c("A", "L", "F", "C", "G", "I", "V", "W") ) group2 <- list( "hydrophobicity" = c("G", "A", "S", "T", "P", "H", "Y"), "normwaalsvolume" = c("N", "V", "E", "Q", "I", "L"), "polarizability" = c("C", "P", "N", "V", "E", "Q", "I", "L"), "secondarystruct" = c("V", "I", "Y", "C", "W", "F", "T"), "solventaccess" = c("R", "K", "Q", "E", "N", "D") ) group3 <- list( "hydrophobicity" = c("C", "L", "V", "I", "M", "F", "W"), "normwaalsvolume" = c("M", "H", "K", "F", "R", "Y", "W"), "polarizability" = c("K", "M", "H", "F", "R", "Y", "W"), "secondarystruct" = c("G", "N", "P", "S", "D"), "solventaccess" = c("M", "S", "P", "T", "H", "Y") ) extractCTDTClass(x, aagroup1 = group1, aagroup2 = group2, aagroup3 = group3)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] # using five customized amino acid property classification group1 <- list( "hydrophobicity" = c("R", "K", "E", "D", "Q", "N"), "normwaalsvolume" = c("G", "A", "S", "T", "P", "D", "C"), "polarizability" = c("G", "A", "S", "D", "T"), "secondarystruct" = c("E", "A", "L", "M", "Q", "K", "R", "H"), "solventaccess" = c("A", "L", "F", "C", "G", "I", "V", "W") ) group2 <- list( "hydrophobicity" = c("G", "A", "S", "T", "P", "H", "Y"), "normwaalsvolume" = c("N", "V", "E", "Q", "I", "L"), "polarizability" = c("C", "P", "N", "V", "E", "Q", "I", "L"), "secondarystruct" = c("V", "I", "Y", "C", "W", "F", "T"), "solventaccess" = c("R", "K", "Q", "E", "N", "D") ) group3 <- list( "hydrophobicity" = c("C", "L", "V", "I", "M", "F", "W"), "normwaalsvolume" = c("M", "H", "K", "F", "R", "Y", "W"), "polarizability" = c("K", "M", "H", "F", "R", "Y", "W"), "secondarystruct" = c("G", "N", "P", "S", "D"), "solventaccess" = c("M", "S", "P", "T", "H", "Y") ) extractCTDTClass(x, aagroup1 = group1, aagroup2 = group2, aagroup3 = group3)
This function calculates the Conjoint Triad descriptor (dim: 343).
extractCTriad(x)
extractCTriad(x)
x |
A character vector, as the input protein sequence. |
A length 343 named vector
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
J.W. Shen, J. Zhang, X.M. Luo, W.L. Zhu, K.Q. Yu, K.X. Chen, Y.X. Li, H.L. Jiang. Predicting Protein-protein Interactions Based Only on Sequences Information. Proceedings of the National Academy of Sciences. 007, 104, 4337–4341.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractCTriad(x)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractCTriad(x)
This function calculates the Conjoint Triad descriptor, with customized amino acid classification support.
extractCTriadClass(x, aaclass)
extractCTriadClass(x, aaclass)
x |
A character vector, as the input protein sequence. |
aaclass |
A list containing the customized amino acid classification. See example below. |
A length k^3
named vector, where k
is the number
of customized classes of the amino acids.
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
J.W. Shen, J. Zhang, X.M. Luo, W.L. Zhu, K.Q. Yu, K.X. Chen, Y.X. Li, H.L. Jiang. Predicting Protein-protein Interactions Based Only on Sequences Information. Proceedings of the National Academy of Sciences. 007, 104, 4337–4341.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] # use customized amino acid classification (normalized van der Waals volume) newclass <- list( c("G", "A", "S", "T", "P", "D", "C"), c("N", "V", "E", "Q", "I", "L"), c("M", "H", "K", "F", "R", "Y", "W") ) extractCTriadClass(x, aaclass = newclass)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] # use customized amino acid classification (normalized van der Waals volume) newclass <- list( c("G", "A", "S", "T", "P", "D", "C"), c("N", "V", "E", "Q", "I", "L"), c("M", "H", "K", "F", "R", "Y", "W") ) extractCTriadClass(x, aaclass = newclass)
This function calculates the Dipeptide Composition descriptor (dim: 400).
extractDC(x)
extractDC(x)
x |
A character vector, as the input protein sequence. |
A length 400 named vector
Nan Xiao <https://nanx.me>
M. Bhasin, G. P. S. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. Journal of Biological Chemistry, 2004, 279, 23262.
See extractAAC
and extractTC
for Amino Acid Composition and Tripeptide Composition descriptors.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractDC(x)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractDC(x)
This function calculates the scales-based descriptors with molecular descriptors sets calculated by Dragon, Discovery Studio and MOE. Users can specify which molecular descriptors to select from one of these deseriptor sets by specify the numerical or character index of the molecular descriptors in the descriptor set.
extractDescScales( x, propmat, index = NULL, pc, lag, scale = TRUE, silent = TRUE )
extractDescScales( x, propmat, index = NULL, pc, lag, scale = TRUE, silent = TRUE )
x |
A character vector, as the input protein sequence. |
propmat |
The matrix containing the descriptor set for the amino acids,
which can be chosen from
|
index |
Integer vector or character vector. Specify which
molecular descriptors to select from one of these deseriptor
sets by specify the numerical or character index of the molecular
descriptors in the descriptor set.
Default is |
pc |
Integer. The maximum dimension of the space which the data are to be represented in. Must be no greater than the number of amino acid properties provided. |
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the property matrix
( |
silent |
Logical. Whether we print the standard deviation,
proportion of variance and the cumulative proportion of
the selected principal components or not. Default is |
A length lag * p^2
named vector,
p
is the number of scales selected.
Nan Xiao <https://nanx.me>
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] descscales <- extractDescScales( x, propmat = "AATopo", index = c(37:41, 43:47), pc = 5, lag = 7, silent = FALSE )
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] descscales <- extractDescScales( x, propmat = "AATopo", index = c(37:41, 43:47), pc = 5, lag = 7, silent = FALSE )
This function calculates scales-based descriptors derived by Factor Analysis (FA). Users can provide customized amino acid property matrices.
extractFAScales( x, propmat, factors, scores = "regression", lag, scale = TRUE, silent = TRUE )
extractFAScales( x, propmat, factors, scores = "regression", lag, scale = TRUE, silent = TRUE )
x |
A character vector, as the input protein sequence. |
propmat |
A matrix containing the properties for the amino acids. Each row represent one amino acid type, each column represents one property. Note that the one-letter row names must be provided for we need them to seek the properties for each AA type. |
factors |
Integer. The number of factors to be fitted. Must be no greater than the number of AA properties provided. |
scores |
Type of scores to produce. The default is |
lag |
The lag parameter. Must be less than the amino acids number in the protein sequence. |
scale |
Logical. Should we auto-scale the property matrix
( |
silent |
Logical. Whether we print the SS loadings,
proportion of variance and the cumulative proportion of
the selected factors or not. Default is |
A length lag * p^2
named vector,
p
is the number of scales (factors) selected.
Nan Xiao <https://nanx.me>
Atchley, W. R., Zhao, J., Fernandes, A. D., & Druke, T. (2005). Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences of the United States of America, 102(18), 6395-6400.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] data(AATopo) tprops <- AATopo[, c(37:41, 43:47)] # select a set of topological descriptors fa <- extractFAScales(x, propmat = tprops, factors = 5, lag = 7, silent = FALSE)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] data(AATopo) tprops <- AATopo[, c(37:41, 43:47)] # select a set of topological descriptors fa <- extractFAScales(x, propmat = tprops, factors = 5, lag = 7, silent = FALSE)
This function calculates the Geary
autocorrelation descriptor (dim: length(props) * nlag
).
extractGeary( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
extractGeary( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
x |
A character vector, as the input protein sequence. |
props |
A character vector, specifying the Accession Number of the target properties. 8 properties are used by default, as listed below:
|
nlag |
Maximum value of the lag parameter. Default is |
customprops |
A |
A length length(props) * nlag
named vector.
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
AAindex: Amino acid index database. https://www.genome.jp/dbget/aaindex.html
Feng, Z.P. and Zhang, C.T. (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19, 269-275.
Horne, D.S. (1988) Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers, 27, 451-477.
Sokal, R.R. and Thomson, B.A. (2006) Population structure inferred by local spatial autocorrelation: an usage from an Amerindian tribal population. American Journal of Physical Anthropology, 129, 121-131.
See extractMoreauBroto
and extractMoran
for Moreau-Broto autocorrelation descriptors and
Moran autocorrelation descriptors.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractGeary(x) myprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43) ) # Use 4 properties in the AAindex database, and 3 cutomized properties extractGeary( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ) )
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractGeary(x) myprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43) ) # Use 4 properties in the AAindex database, and 3 cutomized properties extractGeary( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ) )
This function calculates scales-based descriptors derived by Multidimensional Scaling (MDS). Users can provide customized amino acid property matrices.
extractMDSScales(x, propmat, k, lag, scale = TRUE, silent = TRUE)
extractMDSScales(x, propmat, k, lag, scale = TRUE, silent = TRUE)
x |
A character vector, as the input protein sequence. |
propmat |
A matrix containing the properties for the amino acids. Each row represent one amino acid type, each column represents one property. Note that the one-letter row names must be provided for we need them to seek the properties for each AA type. |
k |
Integer. The maximum dimension of the space which the data are to be represented in. Must be no greater than the number of AA properties provided. |
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the property matrix
( |
silent |
Logical. Whether to print the |
A length lag * p^2
named vector,
p
is the number of scales (dimensionality) selected.
Nan Xiao <https://nanx.me>
Venkatarajan, M. S., & Braun, W. (2001). New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. Molecular modeling annual, 7(12), 445–453.
See extractScales
for scales-based
descriptors derived by Principal Components Analysis.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] data(AATopo) tprops <- AATopo[, c(37:41, 43:47)] # select a set of topological descriptors mds <- extractMDSScales(x, propmat = tprops, k = 5, lag = 7, silent = FALSE)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] data(AATopo) tprops <- AATopo[, c(37:41, 43:47)] # select a set of topological descriptors mds <- extractMDSScales(x, propmat = tprops, k = 5, lag = 7, silent = FALSE)
This function calculates the Moran
autocorrelation descriptor (dim: length(props) * nlag
).
extractMoran( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
extractMoran( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
x |
A character vector, as the input protein sequence. |
props |
A character vector, specifying the Accession Number of the target properties. 8 properties are used by default, as listed below:
|
nlag |
Maximum value of the lag parameter. Default is |
customprops |
A |
A length length(props) * nlag
named vector.
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
AAindex: Amino acid index database. https://www.genome.jp/dbget/aaindex.html
Feng, Z.P. and Zhang, C.T. (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19, 269-275.
Horne, D.S. (1988) Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers, 27, 451-477.
Sokal, R.R. and Thomson, B.A. (2006) Population structure inferred by local spatial autocorrelation: an usage from an Amerindian tribal population. American Journal of Physical Anthropology, 129, 121-131.
See extractMoreauBroto
and extractGeary
for Moreau-Broto autocorrelation descriptors and
Geary autocorrelation descriptors.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractMoran(x) myprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43) ) # Use 4 properties in the AAindex database, and 3 cutomized properties extractMoran( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ) )
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractMoran(x) myprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43) ) # Use 4 properties in the AAindex database, and 3 cutomized properties extractMoran( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ) )
This function calculates the normalized Moreau-Broto
autocorrelation descriptor (dim: length(props) * nlag
).
extractMoreauBroto( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
extractMoreauBroto( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
x |
A character vector, as the input protein sequence. |
props |
A character vector, specifying the Accession Number of the target properties. 8 properties are used by default, as listed below:
|
nlag |
Maximum value of the lag parameter. Default is |
customprops |
A |
A length length(props) * nlag
named vector.
For this descriptor type, users need to intelligently evaluate the underlying details of the descriptors provided, instead of using this function with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
AAindex: Amino acid index database. https://www.genome.jp/dbget/aaindex.html
Feng, Z.P. and Zhang, C.T. (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19, 269-275.
Horne, D.S. (1988) Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers, 27, 451-477.
Sokal, R.R. and Thomson, B.A. (2006) Population structure inferred by local spatial autocorrelation: an usage from an Amerindian tribal population. American Journal of Physical Anthropology, 129, 121-131.
See extractMoran
and extractGeary
for Moran autocorrelation descriptors and Geary autocorrelation descriptors.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractMoreauBroto(x) myprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43) ) # Use 4 properties in the AAindex database, and 3 cutomized properties extractMoreauBroto( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ) )
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractMoreauBroto(x) myprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43) ) # Use 4 properties in the AAindex database, and 3 cutomized properties extractMoreauBroto( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ) )
This function calculates the Pseudo Amino Acid Composition (PseAAC)
descriptor (dim: 20 + lambda
, default is 50).
extractPAAC( x, props = c("Hydrophobicity", "Hydrophilicity", "SideChainMass"), lambda = 30, w = 0.05, customprops = NULL )
extractPAAC( x, props = c("Hydrophobicity", "Hydrophilicity", "SideChainMass"), lambda = 30, w = 0.05, customprops = NULL )
x |
A character vector, as the input protein sequence. |
props |
A character vector, specifying the properties used. 3 properties are used by default, as listed below:
|
lambda |
The lambda parameter for the PseAAC descriptors, default is 30. |
w |
The weighting factor, default is 0.05. |
customprops |
A |
A length 20 + lambda
named vector
Note the default 20 * 3
prop
values have already been
independently given in the function. Users can also specify
other (up to 544) properties with the Accession Number in
the AAindex
data, with or without the default
three properties, which means users should explicitly specify
the properties to use. For this descriptor type, users need to
intelligently evaluate the underlying details of the descriptors
provided, instead of using this function with their data blindly.
It would be wise to use some negative and positive control comparisons
where relevant to help guide interpretation of the results.
Nan Xiao <https://nanx.me>
Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246-255.
Kuo-Chen Chou. Using Amphiphilic Pseudo Amino Acid Composition to Predict Enzyme Subfamily Classes. Bioinformatics, 2005, 21, 10-19.
JACS, 1962, 84: 4240-4246. (C. Tanford). (The hydrophobicity data)
PNAS, 1981, 78:3824-3828 (T.P.Hopp & K.R.Woods). (The hydrophilicity data)
CRC Handbook of Chemistry and Physics, 66th ed., CRC Press, Boca Raton, Florida (1985). (The side-chain mass data)
R.M.C. Dawson, D.C. Elliott, W.H. Elliott, K.M. Jones, Data for Biochemical Research 3rd ed., Clarendon Press Oxford (1986). (The side-chain mass data)
See extractAPAAC
for amphiphilic pseudo
amino acid composition descriptor.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractPAAC(x) myprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43) ) # use 3 default properties, 4 properties from the # AAindex database, and 3 cutomized properties extractPAAC( x, customprops = myprops, props = c( "Hydrophobicity", "Hydrophilicity", "SideChainMass", "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ) )
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractPAAC(x) myprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43) ) # use 3 default properties, 4 properties from the # AAindex database, and 3 cutomized properties extractPAAC( x, customprops = myprops, props = c( "Hydrophobicity", "Hydrophilicity", "SideChainMass", "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ) )
This function calculates amino acid properties based scales descriptors (protein fingerprint). Users can specify which AAindex properties to select from the AAindex database by specify the numerical or character index of the properties in the AAindex database.
extractProtFP(x, index = NULL, pc, lag, scale = TRUE, silent = TRUE)
extractProtFP(x, index = NULL, pc, lag, scale = TRUE, silent = TRUE)
x |
A character vector, as the input protein sequence. |
index |
Integer vector or character vector. Specify which AAindex
properties to select from the AAindex database by specify the
numerical or character index of the properties in the AAindex database.
Default is |
pc |
Integer. Use the first pc principal components as the scales. Must be no greater than the number of AA properties provided. |
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the property matrix
before PCA? Default is |
silent |
Logical. Whether we print the standard deviation,
proportion of variance and the cumulative proportion of
the selected principal components or not. Default is |
A length lag * p^2
named vector,
p
is the number of scales (principal components) selected.
Nan Xiao <https://nanx.me>
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] fp <- extractProtFP(x, index = c(160:165, 258:296), pc = 5, lag = 7, silent = FALSE)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] fp <- extractProtFP(x, index = c(160:165, 258:296), pc = 5, lag = 7, silent = FALSE)
This function calculates amino acid properties based scales descriptors (protein fingerprint) with gap support. Users can specify which AAindex properties to select from the AAindex database by specify the numerical or character index of the properties in the AAindex database.
extractProtFPGap(x, index = NULL, pc, lag, scale = TRUE, silent = TRUE)
extractProtFPGap(x, index = NULL, pc, lag, scale = TRUE, silent = TRUE)
x |
A character vector, as the input protein sequence.
Use ' |
index |
Integer vector or character vector. Specify which AAindex
properties to select from the AAindex database by specify the
numerical or character index of the properties in the AAindex database.
Default is |
pc |
Integer. Use the first pc principal components as the scales. Must be no greater than the number of AA properties provided. |
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the property matrix
before PCA? Default is |
silent |
Logical. Whether we print the standard deviation,
proportion of variance and the cumulative proportion of
the selected principal components or not. Default is |
A length lag * p^2
named vector,
p
is the number of scales (principal components) selected.
Nan Xiao <https://nanx.me>
# amino acid sequence with gaps x <- readFASTA(system.file("protseq/align.fasta", package = "protr"))$`IXI_235` fp <- extractProtFPGap(x, index = c(160:165, 258:296), pc = 5, lag = 7, silent = FALSE)
# amino acid sequence with gaps x <- readFASTA(system.file("protseq/align.fasta", package = "protr"))$`IXI_235` fp <- extractProtFPGap(x, index = c(160:165, 258:296), pc = 5, lag = 7, silent = FALSE)
This function calculates the PSSM (Position-Specific Scoring Matrix) derived by PSI-Blast for given protein sequence or peptides.
extractPSSM( seq, start.pos = 1L, end.pos = nchar(seq), psiblast.path = NULL, makeblastdb.path = NULL, database.path = NULL, iter = 5, silent = TRUE, evalue = 10L, word.size = NULL, gapopen = NULL, gapextend = NULL, matrix = "BLOSUM62", threshold = NULL, seg = "no", soft.masking = FALSE, culling.limit = NULL, best.hit.overhang = NULL, best.hit.score.edge = NULL, xdrop.ungap = NULL, xdrop.gap = NULL, xdrop.gap.final = NULL, window.size = NULL, gap.trigger = 22L, num.threads = 1L, pseudocount = 0L, inclusion.ethresh = 0.002 )
extractPSSM( seq, start.pos = 1L, end.pos = nchar(seq), psiblast.path = NULL, makeblastdb.path = NULL, database.path = NULL, iter = 5, silent = TRUE, evalue = 10L, word.size = NULL, gapopen = NULL, gapextend = NULL, matrix = "BLOSUM62", threshold = NULL, seg = "no", soft.masking = FALSE, culling.limit = NULL, best.hit.overhang = NULL, best.hit.score.edge = NULL, xdrop.ungap = NULL, xdrop.gap = NULL, xdrop.gap.final = NULL, window.size = NULL, gap.trigger = 22L, num.threads = 1L, pseudocount = 0L, inclusion.ethresh = 0.002 )
seq |
Character vector, as the input protein sequence. |
start.pos |
Optional integer denoting the start position of the
fragment window. Default is |
end.pos |
Optional integer denoting the end position of the
fragment window. Default is |
psiblast.path |
Character string indicating the path of the
|
makeblastdb.path |
Character string indicating the path of the
|
database.path |
Character string indicating the path of a reference database (a FASTA file). |
iter |
Number of iterations to perform for PSI-Blast. |
silent |
Logical. Whether the PSI-Blast running output
should be shown or not (May not work on some Windows versions and
PSI-Blast versions), default is |
evalue |
Expectation value (E) threshold for saving hits.
Default is |
word.size |
Word size for wordfinder algorithm. An integer >= 2. |
gapopen |
Integer. Cost to open a gap. |
gapextend |
Integer. Cost to extend a gap. |
matrix |
Character string. The scoring matrix name
(default is |
threshold |
Minimum word score such that the word is added to the BLAST lookup table. A real value >= 0. |
seg |
Character string. Filter query sequence with SEG ( |
soft.masking |
Logical. Apply filtering locations as soft masks?
Default is |
culling.limit |
An integer >= 0. If the query range of a hit is
enveloped by that of at least this many higher-scoring hits,
delete the hit. Incompatible with |
best.hit.overhang |
Best Hit algorithm overhang value
(A real value >= 0 and =< 0.5, recommended value: 0.1).
Incompatible with |
best.hit.score.edge |
Best Hit algorithm score edge value
(A real value >=0 and =< 0.5, recommended value: 0.1).
Incompatible with |
xdrop.ungap |
X-dropoff value (in bits) for ungapped extensions. |
xdrop.gap |
X-dropoff value (in bits) for preliminary gapped extensions. |
xdrop.gap.final |
X-dropoff value (in bits) for final gapped alignment. |
window.size |
An integer >= 0. Multiple hits window size,
To specify 1-hit algorithm, use |
gap.trigger |
Number of bits to trigger gapping. Default is |
num.threads |
Integer. Number of threads (CPUs) to use in the
BLAST search. Default is |
pseudocount |
Integer. Pseudo-count value used when constructing PSSM.
Default is |
inclusion.ethresh |
E-value inclusion threshold for pairwise alignments.
Default is |
For given protein sequences or peptides, PSSM represents the log-likelihood of the substitution of the 20 types of amino acids at that position in the sequence. Note that the output value is not normalized.
The original PSSM, a numeric matrix which has
end.pos - start.pos + 1
columns and 20
named rows.
The function requires the makeblastdb
and psiblast
programs
to be properly installed in the operation system or
their paths provided.
The two command-line programs are included in the NCBI-BLAST+ software package. To install NCBI Blast+ for your operating system, see https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html for detailed instructions.
Ubuntu/Debian users can directly use the command
sudo apt-get install ncbi-blast+
to install NCBI Blast+.
For OS X users, download ncbi-blast- ... .dmg
then install.
For Windows users, download ncbi-blast- ... .exe
then install.
Nan Xiao <https://nanx.me>
Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic acids research 25.17 (1997): 3389–3402.
Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356–3363.
Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239–4247.
extractPSSMFeature extractPSSMAcc
if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") { cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+ first") } else { x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr" ))[[1]] dbpath <- tempfile("tempdb", fileext = ".fasta") invisible(file.copy(from = system.file( "protseq/Plasminogen.fasta", package = "protr" ), to = dbpath)) pssmmat <- extractPSSM(seq = x, database.path = dbpath) dim(pssmmat) # 20 x 562 (P00750: length 562, 20 Amino Acids) }
if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") { cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+ first") } else { x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr" ))[[1]] dbpath <- tempfile("tempdb", fileext = ".fasta") invisible(file.copy(from = system.file( "protseq/Plasminogen.fasta", package = "protr" ), to = dbpath)) pssmmat <- extractPSSM(seq = x, database.path = dbpath) dim(pssmmat) # 20 x 562 (P00750: length 562, 20 Amino Acids) }
This function calculates the feature vector based on the PSSM by running PSI-Blast and auto cross covariance tranformation.
extractPSSMAcc(pssmmat, lag)
extractPSSMAcc(pssmmat, lag)
pssmmat |
The PSSM computed by |
lag |
The lag parameter. Must be less than the number of amino acids in the sequence (i.e. the number of columns in the PSSM matrix). |
A length lag * 20^2
named numeric vector,
the element names are derived by the amino acid name abbreviation
(crossed amino acid name abbreviation) and lag index.
Nan Xiao <https://nanx.me>
Wold, S., Jonsson, J., Sjorstrom, M., Sandberg, M., & Rannar, S. (1993). DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures. Analytica chimica acta, 277(2), 239–253.
extractPSSM extractPSSMFeature
if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") { cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+") } else { x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr" ))[[1]] dbpath <- tempfile("tempdb", fileext = ".fasta") invisible(file.copy(from = system.file( "protseq/Plasminogen.fasta", package = "protr" ), to = dbpath)) pssmmat <- extractPSSM(seq = x, database.path = dbpath) pssmacc <- extractPSSMAcc(pssmmat, lag = 3) tail(pssmacc) }
if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") { cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+") } else { x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr" ))[[1]] dbpath <- tempfile("tempdb", fileext = ".fasta") invisible(file.copy(from = system.file( "protseq/Plasminogen.fasta", package = "protr" ), to = dbpath)) pssmmat <- extractPSSM(seq = x, database.path = dbpath) pssmacc <- extractPSSMAcc(pssmmat, lag = 3) tail(pssmacc) }
This function calculates the profile-based protein representation
derived by PSSM. The feature vector is based on the PSSM computed by
extractPSSM
.
extractPSSMFeature(pssmmat)
extractPSSMFeature(pssmmat)
pssmmat |
The PSSM computed by |
For a given sequence, the PSSM feature represents the log-likelihood of the substitution of the 20 types of amino acids at that position in the sequence.
Each PSSM feature value in the vector represents the degree of conservation of a given amino acid type. The value is normalized to interval (0, 1) by the transformation 1/(1+e^(-x)).
A numeric vector which has 20 x N
named elements,
where N
is the size of the window (number of rows of the PSSM).
Nan Xiao <https://nanx.me>
Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356–3363.
Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239–4247.
if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") { cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+") } else { x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr" ))[[1]] dbpath <- tempfile("tempdb", fileext = ".fasta") invisible(file.copy(from = system.file( "protseq/Plasminogen.fasta", package = "protr" ), to = dbpath)) pssmmat <- extractPSSM(seq = x, database.path = dbpath) pssmfeature <- extractPSSMFeature(pssmmat) head(pssmfeature) }
if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") { cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+") } else { x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr" ))[[1]] dbpath <- tempfile("tempdb", fileext = ".fasta") invisible(file.copy(from = system.file( "protseq/Plasminogen.fasta", package = "protr" ), to = dbpath)) pssmmat <- extractPSSM(seq = x, database.path = dbpath) pssmfeature <- extractPSSMFeature(pssmmat) head(pssmfeature) }
This function calculates the Quasi-Sequence-Order (QSO) descriptor
(dim: 20 + 20 + (2 * nlag)
, default is 100).
extractQSO(x, nlag = 30, w = 0.1)
extractQSO(x, nlag = 30, w = 0.1)
x |
A character vector, as the input protein sequence. |
nlag |
The maximum lag, defualt is 30. |
w |
The weighting factor, default is 0.1. |
A length 20 + 20 + (2 * nlag)
named vector
Nan Xiao <https://nanx.me>
Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications, 2000, 278, 477-483.
Kuo-Chen Chou and Yu-Dong Cai. Prediction of Protein Sucellular Locations by GO-FunD-PseAA Predictor. Biochemical and Biophysical Research Communications, 2004, 320, 1236-1239.
Gisbert Schneider and Paul Wrede. The Rational Design of Amino Acid Sequences by Artifical Neural Networks and Simulated Molecular Evolution: Do Novo Design of an Idealized Leader Cleavge Site. Biophys Journal, 1994, 66, 335-344.
See extractSOCN
for sequence-order-coupling numbers.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractQSO(x)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractQSO(x)
This function calculates scales-based descriptors derived by Principal Components Analysis (PCA). Users can provide customized amino acid property matrices. This function implements the core computation procedure needed for the scales-based descriptors derived by AA-Properties (AAindex) and scales-based descriptors derived by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.) in the protr package.
extractScales(x, propmat, pc, lag, scale = TRUE, silent = TRUE)
extractScales(x, propmat, pc, lag, scale = TRUE, silent = TRUE)
x |
A character vector, as the input protein sequence. |
propmat |
A matrix containing the properties for the amino acids. Each row represent one amino acid type, each column represents one property. Note that the one-letter row names must be provided for we need them to seek the properties for each AA type. |
pc |
Integer. Use the first pc principal components as the scales. Must be no greater than the number of AA properties provided. |
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the property matrix
( |
silent |
Logical. Whether we print the standard deviation,
proportion of variance and the cumulative proportion of
the selected principal components or not. Default is |
A length lag * p^2
named vector,
p
is the number of scales (principal components) selected.
Nan Xiao <https://nanx.me>
See extractDescScales
scales descriptors based on
20+ classes of molecular descriptors, and extractProtFP
for amino acid property based scales descriptors (protein fingerprint).
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] data(AAindex) AAidxmat <- t(na.omit(as.matrix(AAindex[, 7:26]))) scales <- extractScales(x, propmat = AAidxmat, pc = 5, lag = 7, silent = FALSE)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] data(AAindex) AAidxmat <- t(na.omit(as.matrix(AAindex[, 7:26]))) scales <- extractScales(x, propmat = AAidxmat, pc = 5, lag = 7, silent = FALSE)
This function calculates scales-based descriptors derived by Principal Components Analysis (PCA), with gap support. Users can provide customized amino acid property matrices. This function implements the core computation procedure needed for the scales-based descriptors derived by AA-Properties (AAindex) and scales-based descriptors derived by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.) in the protr package.
extractScalesGap(x, propmat, pc, lag, scale = TRUE, silent = TRUE)
extractScalesGap(x, propmat, pc, lag, scale = TRUE, silent = TRUE)
x |
A character vector, as the input protein sequence.
Use ' |
propmat |
A matrix containing the properties for the amino acids. Each row represent one amino acid type, each column represents one property. Note that the one-letter row names must be provided for we need them to seek the properties for each AA type. |
pc |
Integer. Use the first pc principal components as the scales. Must be no greater than the number of AA properties provided. |
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the property matrix
( |
silent |
Logical. Whether to print the standard deviation,
proportion of variance and the cumulative proportion of
the selected principal components or not. Default is |
A length lag * p^2
named vector,
p
is the number of scales (principal components) selected.
Nan Xiao <https://nanx.me>
See extractProtFPGap
for amino acid property based
scales descriptors (protein fingerprint) with gap support.
# amino acid sequence with gaps x <- readFASTA(system.file("protseq/align.fasta", package = "protr"))$`IXI_235` data(AAindex) AAidxmat <- t(na.omit(as.matrix(AAindex[, 7:26]))) scales <- extractScalesGap(x, propmat = AAidxmat, pc = 5, lag = 7, silent = FALSE)
# amino acid sequence with gaps x <- readFASTA(system.file("protseq/align.fasta", package = "protr"))$`IXI_235` data(AAindex) AAidxmat <- t(na.omit(as.matrix(AAindex[, 7:26]))) scales <- extractScalesGap(x, propmat = AAidxmat, pc = 5, lag = 7, silent = FALSE)
This function calculates the Sequence-Order-Coupling Numbers
(dim: nlag * 2
, default is 60).
extractSOCN(x, nlag = 30)
extractSOCN(x, nlag = 30)
x |
A character vector, as the input protein sequence. |
nlag |
The maximum lag, defualt is 30. |
A length nlag * 2
named vector
Nan Xiao <https://nanx.me>
Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications, 2000, 278, 477-483.
Kuo-Chen Chou and Yu-Dong Cai. Prediction of Protein Sucellular Locations by GO-FunD-PseAA Predictor. Biochemical and Biophysical Research Communications, 2004, 320, 1236-1239.
Gisbert Schneider and Paul Wrede. The Rational Design of Amino Acid Sequences by Artifical Neural Networks and Simulated Molecular Evolution: Do Novo Design of an Idealized Leader Cleavge Site. Biophys Journal, 1994, 66, 335-344.
See extractQSO
for quasi-sequence-order descriptors.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractSOCN(x)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractSOCN(x)
This function calculates the Tripeptide Composition descriptor (dim: 8,000).
extractTC(x)
extractTC(x)
x |
A character vector, as the input protein sequence. |
A length 8,000 named vector
Nan Xiao <https://nanx.me>
M. Bhasin, G. P. S. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. Journal of Biological Chemistry, 2004, 279, 23262.
See extractAAC
and extractDC
for Amino Acid Composition and Dipeptide Composition descriptors.
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractTC(x)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] extractTC(x)
This function retrieves protein sequences from uniprot.org by protein ID(s).
getUniProt(id)
getUniProt(id)
id |
A character vector, as the protein ID(s). |
A list, each component contains one protein sequence.
Nan Xiao <https://nanx.me>
See readFASTA
for reading FASTA format files.
## Not run: # Network latency may slow down this example # Only test this when your connection is fast enough ids <- c("P00750", "P00751", "P00752") getUniProt(ids) ## End(Not run)
## Not run: # Network latency may slow down this example # Only test this when your connection is fast enough ids <- c("P00750", "P00751", "P00752") getUniProt(ids) ## End(Not run)
OptAA3d.sdf - 20 Amino Acids Optimized with MOE 2011.10 (Semiempirical AM1)
# this operation requires the rcdk package # require(rcdk) # optaa3d = load.molecules(system.file("sysdata/OptAA3d.sdf", package = "protr")) # view.molecule.2d(optaa3d[[1]]) # view the first AA
# this operation requires the rcdk package # require(rcdk) # optaa3d = load.molecules(system.file("sysdata/OptAA3d.sdf", package = "protr")) # view.molecule.2d(optaa3d[[1]]) # view the first AA
This function calculates protein similarity based on Gene Ontology (GO) similarity.
parGOSim( golist, type = c("go", "gene"), ont = c("MF", "BP", "CC"), organism = "human", measure = "Resnik", combine = "BMA" )
parGOSim( golist, type = c("go", "gene"), ont = c("MF", "BP", "CC"), organism = "human", measure = "Resnik", combine = "BMA" )
golist |
A list, each component contains a character vector of GO terms or one Entrez Gene ID. |
type |
Input type for |
ont |
Default is |
organism |
Organism name. Default is |
measure |
Default is |
combine |
Default is |
A n
x n
similarity matrix.
Nan Xiao <https://nanx.me>
See twoGOSim
for calculating the
GO semantic similarity between two groups of GO terms or two Entrez gene IDs.
See parSeqSim
for paralleled protein similarity
calculation based on Smith-Waterman local alignment.
## Not run: # Be careful when testing this since it involves GO similarity computation # and might produce unpredictable results in some environments library("GOSemSim") library("org.Hs.eg.db") # By GO Terms # AP4B1 go1 <- c( "GO:0005215", "GO:0005488", "GO:0005515", "GO:0005625", "GO:0005802", "GO:0005905" ) # BCAS2 go2 <- c( "GO:0005515", "GO:0005634", "GO:0005681", "GO:0008380", "GO:0031202" ) # PDE4DIP go3 <- c( "GO:0003735", "GO:0005622", "GO:0005840", "GO:0006412" ) golist <- list(go1, go2, go3) parGOSim(golist, type = "go", ont = "CC", measure = "Wang") # By Entrez gene id genelist <- list(c("150", "151", "152", "1814", "1815", "1816")) parGOSim(genelist, type = "gene", ont = "BP", measure = "Wang") ## End(Not run)
## Not run: # Be careful when testing this since it involves GO similarity computation # and might produce unpredictable results in some environments library("GOSemSim") library("org.Hs.eg.db") # By GO Terms # AP4B1 go1 <- c( "GO:0005215", "GO:0005488", "GO:0005515", "GO:0005625", "GO:0005802", "GO:0005905" ) # BCAS2 go2 <- c( "GO:0005515", "GO:0005634", "GO:0005681", "GO:0008380", "GO:0031202" ) # PDE4DIP go3 <- c( "GO:0003735", "GO:0005622", "GO:0005840", "GO:0006412" ) golist <- list(go1, go2, go3) parGOSim(golist, type = "go", ont = "CC", measure = "Wang") # By Entrez gene id genelist <- list(c("150", "151", "152", "1814", "1815", "1816")) parGOSim(genelist, type = "gene", ont = "BP", measure = "Wang") ## End(Not run)
Parallel calculation of protein sequence similarity based on sequence alignment.
parSeqSim( protlist, cores = 2, batches = 1, verbose = FALSE, type = "local", submat = "BLOSUM62", gap.opening = 10, gap.extension = 4 )
parSeqSim( protlist, cores = 2, batches = 1, verbose = FALSE, type = "local", submat = "BLOSUM62", gap.opening = 10, gap.extension = 4 )
protlist |
A length |
cores |
Integer. The number of CPU cores to use for parallel execution,
default is |
batches |
Integer. How many batches should we split the pairwise similarity computations into. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and fit all the pairwise similarities into a single batch. Defaults to 1. |
verbose |
Print the computation progress?
Useful when |
type |
Type of alignment, default is |
submat |
Substitution matrix, default is |
gap.opening |
The cost required to open a gap of any length in the alignment. Defaults to 10. |
gap.extension |
The cost to extend the length of an existing gap by 1. Defaults to 4. |
A n
x n
similarity matrix.
Nan Xiao <https://nanx.me>
See parSeqSimDisk
for the disk-based version.
## Not run: # Be careful when testing this since it involves parallelization # and might produce unpredictable results in some environments library("Biostrings") library("foreach") library("doParallel") s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]] s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]] s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]] s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]] plist <- list(s1, s2, s3, s4, s5) (psimmat <- parSeqSim(plist, cores = 2, type = "local", submat = "BLOSUM62")) ## End(Not run)
## Not run: # Be careful when testing this since it involves parallelization # and might produce unpredictable results in some environments library("Biostrings") library("foreach") library("doParallel") s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]] s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]] s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]] s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]] plist <- list(s1, s2, s3, s4, s5) (psimmat <- parSeqSim(plist, cores = 2, type = "local", submat = "BLOSUM62")) ## End(Not run)
Parallel calculation of protein sequence similarity based on sequence alignment. This version offloads the partial results in each batch to the hard drive and merges the results together in the end, which reduces the memory usage.
parSeqSimDisk( protlist, cores = 2, batches = 1, path = tempdir(), verbose = FALSE, type = "local", submat = "BLOSUM62", gap.opening = 10, gap.extension = 4 )
parSeqSimDisk( protlist, cores = 2, batches = 1, path = tempdir(), verbose = FALSE, type = "local", submat = "BLOSUM62", gap.opening = 10, gap.extension = 4 )
protlist |
A length |
cores |
Integer. The number of CPU cores to use for parallel execution,
default is |
batches |
Integer. How many batches should we split the pairwise similarity computations into. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and fit all the pairwise similarities into a single batch. Defaults to 1. |
path |
Directory for caching the results in each batch. Defaults to the temporary directory. |
verbose |
Print the computation progress?
Useful when |
type |
Type of alignment, default is |
submat |
Substitution matrix, default is |
gap.opening |
The cost required to open a gap of any length in the alignment. Defaults to 10. |
gap.extension |
The cost to extend the length of an existing gap by 1. Defaults to 4. |
A n
x n
similarity matrix.
Nan Xiao <https://nanx.me>
See parSeqSim
for the in-memory version.
## Not run: # Be careful when testing this since it involves parallelization # and might produce unpredictable results in some environments library("Biostrings") library("foreach") library("doParallel") s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]] s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]] s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]] s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]] set.seed(1010) plist <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 100, replace = TRUE)]) psimmat <- parSeqSimDisk( plist, cores = 2, batches = 10, verbose = TRUE, type = "local", submat = "BLOSUM62" ) ## End(Not run)
## Not run: # Be careful when testing this since it involves parallelization # and might produce unpredictable results in some environments library("Biostrings") library("foreach") library("doParallel") s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]] s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]] s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]] s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]] set.seed(1010) plist <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 100, replace = TRUE)]) psimmat <- parSeqSimDisk( plist, cores = 2, batches = 10, verbose = TRUE, type = "local", submat = "BLOSUM62" ) ## End(Not run)
This function checks if the protein sequence's amino acid types are in the 20 default types.
protcheck(x)
protcheck(x)
x |
A character vector, as the input protein sequence. |
Logical. TRUE
if all of the amino acid types
of the sequence are within the 20 default types.
Nan Xiao <https://nanx.me>
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] protcheck(x) # TRUE protcheck(paste(x, "Z", sep = "")) # FALSE
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] protcheck(x) # TRUE protcheck(paste(x, "Z", sep = "")) # FALSE
This function extracts the segmentations from the protein sequence.
protseg( x, aa = c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"), k = 7 )
protseg( x, aa = c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"), k = 7 )
x |
A character vector, as the input protein sequence. |
aa |
A character, the amino acid type. One of
|
k |
A positive integer, specifys the window size (half of the window), default is 7. |
A named list, each component contains one of the segmentations (a character string), names of the list components are the positions of the specified amino acid in the sequence.
Nan Xiao <https://nanx.me>
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] protseg(x, aa = "R", k = 5)
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] protseg(x, aa = "R", k = 5)
This function reads protein sequences in FASTA format.
readFASTA( file = system.file("protseq/P00750.fasta", package = "protr"), legacy.mode = TRUE, seqonly = FALSE )
readFASTA( file = system.file("protseq/P00750.fasta", package = "protr"), legacy.mode = TRUE, seqonly = FALSE )
file |
Path to the file containing the protein sequences
in FASTA format. If it does not contain an absolute or
relative path, the file name is relative to the current
working directory, |
legacy.mode |
If set to |
seqonly |
If set to |
Character vector of the protein sequences.
The three returned argument are just different forms of the same output. If one is interested in a Mahalanobis metric over the original data space, the first argument is all she/he needs. If a transformation into another space (where one can use the Euclidean metric) is preferred, the second returned argument is sufficient. Using A and B is equivalent in the following sense.
Nan Xiao <https://nanx.me>
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America, 85: 2444–2448.
See getUniProt
for retrieving
protein sequences from uniprot.org.
P00750 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))
P00750 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))
This function reads protein sequences in PDB (Protein Data Bank) format, and return the amino acid sequences represented by single-letter code.
readPDB(file = system.file("protseq/4HHB.pdb", package = "protr"))
readPDB(file = system.file("protseq/4HHB.pdb", package = "protr"))
file |
Path to the file containing the protein sequences in PDB format.
If it does not contain an absolute or
relative path, the file name is relative to the current
working directory, |
Character vector of the protein sequence.
Nan Xiao <https://nanx.me>
Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description, Version 3.30. Accessed 2013-06-26. https://files.wwpdb.org/pub/pdb/doc/format_descriptions/Format_v33_Letter.pdf
See readFASTA
for reading
protein sequences in FASTA format.
Seq4HHB <- readPDB(system.file("protseq/4HHB.pdb", package = "protr"))
Seq4HHB <- readPDB(system.file("protseq/4HHB.pdb", package = "protr"))
Remove/replace gaps or any irregular characters from protein sequences, to make them suitable for feature extraction or sequence alignment based similarity computation.
removeGaps(x, pattern = "-", replacement = "", ...)
removeGaps(x, pattern = "-", replacement = "", ...)
x |
character vector, containing the input protein sequence(s). |
pattern |
character string contains the gap (or other irregular)
character to be removed or replaced. Default is |
replacement |
a replacement for matched characters.
Default is |
... |
addtional parameters for |
a vector of protein sequence(s) with gaps or irregular characters removed/replaced.
Nan Xiao <https://nanx.me>
# amino acid sequences that contain gaps ("-") aaseq <- list( "MHGDTPTLHEYMLDLQPETTDLYCYEQLSDSSE-EEDEIDGPAGQAEPDRAHYNIVTFCCKCDSTLRLCVQS", "MHGDTPTLHEYMLDLQPETTDLYCYEQLNDSSE-EEDEIDGPAGQAEPDRAHYNIVTFCCKCDSTLRLCVQS" ) ## Not run: #' # gaps create issues for alignment parSeqSim(aaseq) # remove the gaps nogapseq <- removeGaps(aaseq) parSeqSim(nogapseq) ## End(Not run)
# amino acid sequences that contain gaps ("-") aaseq <- list( "MHGDTPTLHEYMLDLQPETTDLYCYEQLSDSSE-EEDEIDGPAGQAEPDRAHYNIVTFCCKCDSTLRLCVQS", "MHGDTPTLHEYMLDLQPETTDLYCYEQLNDSSE-EEDEIDGPAGQAEPDRAHYNIVTFCCKCDSTLRLCVQS" ) ## Not run: #' # gaps create issues for alignment parSeqSim(aaseq) # remove the gaps nogapseq <- removeGaps(aaseq) parSeqSim(nogapseq) ## End(Not run)
This function calculates the Gene Ontology (GO) similarity between two groups of GO terms or two Entrez gene IDs.
twoGOSim( id1, id2, type = c("go", "gene"), ont = c("MF", "BP", "CC"), organism = "human", measure = "Resnik", combine = "BMA" )
twoGOSim( id1, id2, type = c("go", "gene"), ont = c("MF", "BP", "CC"), organism = "human", measure = "Resnik", combine = "BMA" )
id1 |
Character vector. When length > 1: each element is a GO term; when length = 1: the Entrez Gene ID. |
id2 |
Character vector. When length > 1: each element is a GO term; when length = 1: the Entrez Gene ID. |
type |
Input type of id1 and id2, |
ont |
Default is |
organism |
Organism name. Default is |
measure |
Default is |
combine |
Default is |
Similarity value.
Nan Xiao <https://nanx.me>
See parGOSim
for protein similarity calculation
based on Gene Ontology (GO) semantic similarity.
See parSeqSim
for paralleled protein similarity
calculation based on Smith-Waterman local alignment.
## Not run: # Be careful when testing this since it involves GO similarity computation # and might produce unpredictable results in some environments library("GOSemSim") library("org.Hs.eg.db") # By GO terms go1 <- c("GO:0004022", "GO:0004024", "GO:0004023") go2 <- c("GO:0009055", "GO:0020037") twoGOSim(go1, go2, type = "go", ont = "MF", measure = "Wang") # By Entrez gene id gene1 <- "1956" # EGFR gene2 <- "2261" # FGFR3 twoGOSim(gene1, gene2, type = "gene", ont = "BP", measure = "Lin") ## End(Not run)
## Not run: # Be careful when testing this since it involves GO similarity computation # and might produce unpredictable results in some environments library("GOSemSim") library("org.Hs.eg.db") # By GO terms go1 <- c("GO:0004022", "GO:0004024", "GO:0004023") go2 <- c("GO:0009055", "GO:0020037") twoGOSim(go1, go2, type = "go", ont = "MF", measure = "Wang") # By Entrez gene id gene1 <- "1956" # EGFR gene2 <- "2261" # FGFR3 twoGOSim(gene1, gene2, type = "gene", ont = "BP", measure = "Lin") ## End(Not run)
Sequence alignment between two protein sequences.
twoSeqSim( seq1, seq2, type = "local", submat = "BLOSUM62", gap.opening = 10, gap.extension = 4 )
twoSeqSim( seq1, seq2, type = "local", submat = "BLOSUM62", gap.opening = 10, gap.extension = 4 )
seq1 |
Character string, containing one protein sequence. |
seq2 |
Character string, containing another protein sequence. |
type |
Type of alignment, default is |
submat |
Substitution matrix, default is |
gap.opening |
The cost required to open a gap of any length in the alignment. Defaults to 10. |
gap.extension |
The cost to extend the length of an existing gap by 1. Defaults to 4. |
A Biostrings
object containing the alignment scores
and other alignment information.
Nan Xiao <https://nanx.me>
See parSeqSim
for paralleled pairwise
protein similarity calculation based on sequence alignment.
See twoGOSim
for calculating the GO semantic
similarity between two groups of GO terms or two Entrez gene IDs.
## Not run: # Be careful when testing this since it involves sequence alignment # and might produce unpredictable results in some environments library("Biostrings") s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] s2 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]] seqalign <- twoSeqSim(s1, s2) summary(seqalign) score(seqalign) ## End(Not run)
## Not run: # Be careful when testing this since it involves sequence alignment # and might produce unpredictable results in some environments library("Biostrings") s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]] s2 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]] seqalign <- twoSeqSim(s1, s2) summary(seqalign) score(seqalign) ## End(Not run)