Title: | Ordered Homogeneity Pursuit Lasso for Group Variable Selection |
---|---|
Description: | Ordered homogeneity pursuit lasso (OHPL) algorithm for group variable selection proposed in Lin et al. (2017) <DOI:10.1016/j.chemolab.2017.07.004>. The OHPL method exploits the homogeneity structure in high-dimensional data and enjoys the grouping effect to select groups of important variables automatically. This feature makes it particularly useful for high-dimensional datasets with strongly correlated variables, such as spectroscopic data. |
Authors: | You-Wu Lin [aut], Nan Xiao [aut, cre] |
Maintainer: | Nan Xiao <[email protected]> |
License: | GPL-3 | file LICENSE |
Version: | 1.4.1 |
Built: | 2024-10-31 18:37:26 UTC |
Source: | https://github.com/nanxstats/ohpl |
The beer dataset contains 60 samples published by Norgaard et al. Recorded with a 30mm quartz cell on the undiluted degassed beer and measured from 1100 to 2250 nm (576 data points) in steps of 2 nm.
data(beer)
data(beer)
Norgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L., & Engelsen, S. B. (2000). Interval partial least-squares regression (iPLS): a comparative chemometric study with an example from near-infrared spectroscopy. Applied Spectroscopy, 54(3), 413–419.
data("beer") x.cal <- beer$xtrain dim(x.cal) x.test <- beer$xtest dim(x.test) y.cal <- beer$ytrain dim(y.cal) y.test <- beer$ytest dim(y.test) X <- rbind(x.cal, x.test) y <- rbind(y.cal, y.test) n <- nrow(y) set.seed(1001) samp.idx <- sample(1L:n, round(n * 0.7)) X.cal <- X[samp.idx, ] y.cal <- y[samp.idx] X.test <- X[-samp.idx, ] y.test <- y[-samp.idx]
data("beer") x.cal <- beer$xtrain dim(x.cal) x.test <- beer$xtest dim(x.test) y.cal <- beer$ytrain dim(y.cal) y.test <- beer$ytest dim(y.test) X <- rbind(x.cal, x.test) y <- rbind(y.cal, y.test) n <- nrow(y) set.seed(1001) samp.idx <- sample(1L:n, round(n * 0.7)) X.cal <- X[samp.idx, ] y.cal <- y[samp.idx] X.test <- X[-samp.idx, ] y.test <- y[-samp.idx]
Use cross-validation to help select the optimal number
of variable groups and the value of gamma
.
cv.OHPL( X.cal, y.cal, maxcomp, gamma = seq(0.1, 0.9, 0.1), X.test, y.test, cv.folds = 5L, G = 30L, type = c("max", "median"), scale = TRUE, pls.method = "simpls" )
cv.OHPL( X.cal, y.cal, maxcomp, gamma = seq(0.1, 0.9, 0.1), X.test, y.test, cv.folds = 5L, G = 30L, type = c("max", "median"), scale = TRUE, pls.method = "simpls" )
X.cal |
Predictor matrix (training). |
y.cal |
Response matrix with one column (training). |
maxcomp |
Maximum number of components for PLS. |
gamma |
A vector of the gamma sequence between (0, 1). |
X.test |
X.test Predictor matrix (test). |
y.test |
y.test Response matrix with one column (test). |
cv.folds |
Number of cross-validation folds. |
G |
Maximum number of variable groups. |
type |
Find the maximum absolute correlation ( |
scale |
Should the predictor matrix be scaled?
Default is |
pls.method |
Method for fitting the PLS model.
Default is |
A list containing the optimal model, RMSEP, Q2, and other evaluation metrics. Also the optimal number of groups to use in group lasso.
data("wheat") X <- wheat$x y <- wheat$protein n <- nrow(wheat$x) set.seed(1001) samp.idx <- sample(1L:n, round(n * 0.7)) X.cal <- X[samp.idx, ] y.cal <- y[samp.idx] X.test <- X[-samp.idx, ] y.test <- y[-samp.idx] # This could run for a while ## Not run: cv.fit <- cv.OHPL( x, y, maxcomp = 6, gamma = seq(0.1, 0.9, 0.1), x.test, y.test, cv.folds = 5, G = 30, type = "max" ) # the optimal G and gamma cv.fit$opt.G cv.fit$opt.gamma ## End(Not run)
data("wheat") X <- wheat$x y <- wheat$protein n <- nrow(wheat$x) set.seed(1001) samp.idx <- sample(1L:n, round(n * 0.7)) X.cal <- X[samp.idx, ] y.cal <- y[samp.idx] X.test <- X[-samp.idx, ] y.test <- y[-samp.idx] # This could run for a while ## Not run: cv.fit <- cv.OHPL( x, y, maxcomp = 6, gamma = seq(0.1, 0.9, 0.1), x.test, y.test, cv.folds = 5, G = 30, type = "max" ) # the optimal G and gamma cv.fit$opt.G cv.fit$opt.gamma ## End(Not run)
Compute D, L, and C in the Fisher optimal partitions algorithm
dlc(X, maxk)
dlc(X, maxk)
X |
A set of samples. |
maxk |
Maximum number of |
The D, L, and C statistics in the Fisher optimal partitions algorithm.
The Fisher optimal partition algorithm.
FOP(X, k, C)
FOP(X, k, C)
X |
A set of samples. |
k |
Number of classes. |
C |
Statistic from the output of |
Index vector for each sample's classification.
W. D. Fisher (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, vol. 53, pp. 789–798.
X <- matrix(c( 9.3, 1.8, 1.9, 1.7, 1.5, 1.3, 1.4, 2.0, 1.9, 2.3, 2.1 )) C <- dlc(X, maxk = 8)$C F <- FOP(X, 8, C)
X <- matrix(c( 9.3, 1.8, 1.9, 1.7, 1.5, 1.3, 1.4, 2.0, 1.9, 2.3, 2.1 )) C <- dlc(X, maxk = 8)$C F <- FOP(X, 8, C)
Fits the ordered homogeneity pursuit lasso (OHPL) model.
OHPL( x, y, maxcomp, gamma, cv.folds = 5L, G = 30L, type = c("max", "median"), scale = TRUE, pls.method = "simpls" )
OHPL( x, y, maxcomp, gamma, cv.folds = 5L, G = 30L, type = c("max", "median"), scale = TRUE, pls.method = "simpls" )
x |
Predictor matrix. |
y |
Response matrix with one column. |
maxcomp |
Maximum number of components for PLS. |
gamma |
A number between (0, 1) for generating the gamma sequence.
An usual choice for gamma could be |
cv.folds |
Number of cross-validation folds. |
G |
Maximum number of variable groups. |
type |
Find the maximum absolute correlation ( |
scale |
Should the predictor matrix be scaled? Default is |
pls.method |
Method for fitting the PLS model. Default is |
A list of fitted OHPL model object with performance metrics.
You-Wu Lin, Nan Xiao, Li-Li Wang, Chuan-Quan Li, and Qing-Song Xu (2017). Ordered homogeneity pursuit lasso for group variable selection with applications to spectroscopic data. Chemometrics and Intelligent Laboratory Systems 168, 62–71.
# Generate simulation data dat <- OHPL.sim( n = 100, p = 100, rho = 0.8, coef = rep(1, 10), snr = 3, p.train = 0.5, seed = 1010 ) # Split training and test set x <- dat$x.tr y <- dat$y.tr x.test <- dat$x.te y.test <- dat$y.te # Fit the OHPL model fit <- OHPL(x, y, maxcomp = 3, gamma = 0.5, G = 10, type = "max") # Selected variables fit$Vsel # Make predictions y.pred <- predict(fit, x.test) # Compute evaluation metric RMSEP, Q2 and MAE for the test set perf <- OHPL.RMSEP(fit, x.test, y.test) perf$RMSEP perf$Q2 perf$MAE
# Generate simulation data dat <- OHPL.sim( n = 100, p = 100, rho = 0.8, coef = rep(1, 10), snr = 3, p.train = 0.5, seed = 1010 ) # Split training and test set x <- dat$x.tr y <- dat$y.tr x.test <- dat$x.te y.test <- dat$y.te # Fit the OHPL model fit <- OHPL(x, y, maxcomp = 3, gamma = 0.5, G = 10, type = "max") # Selected variables fit$Vsel # Make predictions y.pred <- predict(fit, x.test) # Compute evaluation metric RMSEP, Q2 and MAE for the test set perf <- OHPL.RMSEP(fit, x.test, y.test) perf$RMSEP perf$Q2 perf$MAE
Makes predictions on new data and computes the performance evaluation metrics RMSEP, MAE, and Q2.
OHPL.RMSEP(object, newx, newy)
OHPL.RMSEP(object, newx, newy)
object |
Object of class |
newx |
Predictor matrix of the new data. |
newy |
Response matrix of the new data (matrix with one column). |
A list of the performance metrics.
# Generate simulation data dat <- OHPL.sim( n = 100, p = 100, rho = 0.8, coef = rep(1, 10), snr = 3, p.train = 0.5, seed = 1010 ) # Split training and test set x <- dat$x.tr y <- dat$y.tr x.test <- dat$x.te y.test <- dat$y.te # Fit the OHPL model fit <- OHPL(x, y, maxcomp = 3, gamma = 0.5, G = 10, type = "max") # Compute evaluation metric RMSEP, Q2 and MAE for the test set perf <- OHPL.RMSEP(fit, x.test, y.test) perf$RMSEP perf$Q2 perf$MAE
# Generate simulation data dat <- OHPL.sim( n = 100, p = 100, rho = 0.8, coef = rep(1, 10), snr = 3, p.train = 0.5, seed = 1010 ) # Split training and test set x <- dat$x.tr y <- dat$y.tr x.test <- dat$x.te y.test <- dat$y.te # Fit the OHPL model fit <- OHPL(x, y, maxcomp = 3, gamma = 0.5, G = 10, type = "max") # Compute evaluation metric RMSEP, Q2 and MAE for the test set perf <- OHPL.RMSEP(fit, x.test, y.test) perf$RMSEP perf$Q2 perf$MAE
Generate simulation data (Gaussian case) following the settings in Xiao and Xu (2015).
OHPL.sim( n = 100, p = 100, rho = 0.8, coef = rep(1, 10), snr = 3, p.train = 0.5, seed = 1001 )
OHPL.sim( n = 100, p = 100, rho = 0.8, coef = rep(1, 10), snr = 3, p.train = 0.5, seed = 1001 )
n |
Number of observations. |
p |
Number of variables. |
rho |
Correlation base for generating correlated variables. |
coef |
Vector of non-zero coefficients. |
snr |
Signal-to-noise ratio (SNR). |
p.train |
Percentage of training set. |
seed |
Random seed for reproducibility. |
A list containing x.tr
, x.te
, y.tr
, and y.te
.
Nan Xiao <\url{https://nanx.me}>
Nan Xiao and Qing-Song Xu. (2015). Multi-step adaptive elastic-net: reducing false positives in high-dimensional variable selection. Journal of Statistical Computation and Simulation 85(18), 3755–3765.
dat <- OHPL.sim( n = 100, p = 100, rho = 0.8, coef = rep(1, 10), snr = 3, p.train = 0.5, seed = 1010 ) dim(dat$x.tr) dim(dat$x.te)
dat <- OHPL.sim( n = 100, p = 100, rho = 0.8, coef = rep(1, 10), snr = 3, p.train = 0.5, seed = 1010 ) dim(dat$x.tr) dim(dat$x.te)
Make predictions on new data by an OHPL model object.
## S3 method for class 'OHPL' predict(object, newx, ncomp = NULL, type = "response", ...)
## S3 method for class 'OHPL' predict(object, newx, ncomp = NULL, type = "response", ...)
object |
Object of class |
newx |
Predictor matrix of the new data. |
ncomp |
Optimal number of components.
If is |
type |
Prediction type. |
... |
Additional parameters. |
Numeric matrix of the predicted values.
# generate simulation data dat <- OHPL.sim( n = 100, p = 100, rho = 0.8, coef = rep(1, 10), snr = 3, p.train = 0.5, seed = 1010 ) # split training and test set x <- dat$x.tr y <- dat$y.tr x.test <- dat$x.te y.test <- dat$y.te # fit the OHPL model fit <- OHPL(x, y, maxcomp = 3, gamma = 0.5, G = 10, type = "max") # make predictions y.pred <- predict(fit, x.test) y.pred
# generate simulation data dat <- OHPL.sim( n = 100, p = 100, rho = 0.8, coef = rep(1, 10), snr = 3, p.train = 0.5, seed = 1010 ) # split training and test set x <- dat$x.tr y <- dat$y.tr x.test <- dat$x.te y.test <- dat$y.te # fit the OHPL model fit <- OHPL(x, y, maxcomp = 3, gamma = 0.5, G = 10, type = "max") # make predictions y.pred <- predict(fit, x.test) y.pred
Extracts the prototype from each variable group.
proto(X, y, groups, type = c("max", "median"), mu = NULL)
proto(X, y, groups, type = c("max", "median"), mu = NULL)
X |
Predictor matrix. |
y |
Response matrix with one column. |
groups |
An group index vector containing the
group number each variable belongs to. For example:
|
type |
The rule for extracting the prototype.
Possible options are |
mu |
The mean value of |
The prototypes (variable index) extracted from each group (cluster).
The soil dataset contains 108 sample measurements from the wavelength range of 400–2500 nm (visible and near infrared spectrum) published by Rinnan et al.
data(soil)
data(soil)
Rinnan, R., & Rinnan, A. (2007). Application of near infrared reflectance (NIR) and fluorescence spectroscopy to analysis of microbiological and chemical properties of arctic soil. Soil biology and Biochemistry, 39(7), 1664–1673.
data("soil") X <- soil$x y <- soil$som n <- nrow(soil$x) set.seed(1001) samp.idx <- sample(1L:n, round(n * 0.7)) X.cal <- X[samp.idx, ] y.cal <- y[samp.idx] X.test <- X[-samp.idx, ] y.test <- y[-samp.idx]
data("soil") X <- soil$x y <- soil$som n <- nrow(soil$x) set.seed(1001) samp.idx <- sample(1L:n, round(n * 0.7)) X.cal <- X[samp.idx, ] y.cal <- y[samp.idx] X.test <- X[-samp.idx, ] y.test <- y[-samp.idx]
The wheat dataset contains 100 wheat samples with specified protein and moisture content, published by J. Kalivas. Samples were measured by diffuse reflectance as log (I/R) from 1100 to 2500 nm (701 data points) in 2 nm intervals.
data(wheat)
data(wheat)
Kalivas, J. H. (1997). Two data sets of near infrared spectra. Chemometrics and Intelligent Laboratory Systems, 37(2), 255–259.
data("wheat") X <- wheat$x y <- wheat$protein n <- nrow(wheat$x) set.seed(1001) samp.idx <- sample(1L:n, round(n * 0.7)) X.cal <- X[samp.idx, ] y.cal <- y[samp.idx] X.test <- X[-samp.idx, ] y.test <- y[-samp.idx]
data("wheat") X <- wheat$x y <- wheat$protein n <- nrow(wheat$x) set.seed(1001) samp.idx <- sample(1L:n, round(n * 0.7)) X.cal <- X[samp.idx, ] y.cal <- y[samp.idx] X.test <- X[-samp.idx, ] y.test <- y[-samp.idx]