SelvarMix: A R package for variable selection in model-based clustering and discriminant analysis with a regularization approach.


Site map:

All the experiments are implemented with SelvarMix 1.2


SelvarMix(G. Celeux, Maugis-Rabusseau, and Sedki 2018) package carries out a regularization approach of variable selection in the model-based clustering and classification frameworks. First, the variables are ranked with a lasso-like procedure. Second, the method of (Maugis, Celeux, and Martin-Magniette 2009; Maugis, Celeux, and Martin-Magniette 2011) is adapted to define the role of variables in the two frameworks. This variable ranking allows us to avoid the painfully slow stepwise forward or backward algorithms of (Maugis, Celeux, and Martin-Magniette 2009). Thus, SelvarMix provides a much faster variable selection procedure than (Maugis, Celeux, and Martin-Magniette 2009; Maugis, Celeux, and Martin-Magniette 2011) allowing to study high-dimensional datasets.

Tool functions summary and print facilitate the result interpretation.

Overview of the SelvarMix functions

This section presents the whole analysis of a simulated data set. It makes use all the functions implemented in the package SelvarMix and may be regarded as a tutorial.

The cluster analysis is performed with an unknown number of clusters. An information criterion is used for variable selection and choosing the number of clusters. The chosen model is described in a summary.

The synthetic dataset

The simulated dataset consists of 2000 data points in \(\mathbb{R}^{14}\). On the subset of relevant clustering variables \(S = \{1, 2\}\), data are distributed according to a mixture of four equiprobable spherical Gaussian distributions with means \((0,0), (3,0) (0,3)\) and \((3,3)\). The subset of redundant variables is \(U =\{3-11\}\). These variables are explained by the subset of predictor variables \(R = \{1,2\}\) through a linear regression. The last three variables \(W = \{12, 13, 14\}\) are independent. More details are given in (Maugis, Celeux, and Martin-Magniette 2009).

n <- 2000; p <- 14
x <- matrix(0,n, p)
x[,1] <- rnorm(n,0,1)
x[,2] <- rnorm(n,0,1)
z <-  sample(1:4, n, rep=T)
x[z==2, 1] <- x[z==2, 1] + 3
x[z==3, 2] <- x[z==3, 2] + 3
x[z==4, 1] <- x[z==4, 1] + 3
x[z==4, 2] <- x[z==4, 2] + 3

omega <- matrix(0, 9, 9); diag(omega)[1:3] <- rep(1,3); diag(omega)[4:5] <- rep(0.5,2)
rtmat1 <- matrix(c(cos(pi/3), -sin(pi/3), sin(pi/3), cos(pi/3)), ncol = 2, byrow = TRUE)
rtmat2 <- matrix(c(cos(pi/6), -sin(pi/6), sin(pi/6), cos(pi/6)), ncol = 2, byrow = TRUE)
omega[6:7, 6:7] <- t(rtmat1) %*% diag(c(1,3)) %*% rtmat1
omega[8:9, 8:9] <- t(rtmat2) %*% diag(c(2,6)) %*% rtmat2
b <- cbind(c(0.5,1), c(2,0), c(0,3), c(-1,2), c(2,-4), c(0.5,0), c(4,0.5), c(3,0), c(2,1))
x[,3:11] <- c(0, 0, seq(0.4, 2, len=7)) + x[,1:2]%*%b + t(t(chol(omega)) %*% matrix(rnorm(n*9), 9, n)) 
x[,12:14] <- matrix(rnorm(3*n), n, 3)
x[,12] <- x[,12] + 3.2; x[,13] <- x[,13] + 3.6; x[,13] <- x[,13] + 4

Variable selection and selection of the number of clusters in the clustering framework

# Cluster analysis with variable selection with parallel computing (8 cores) 
# The last two input arguments are optional
obj <- SelvarClustLasso(x=x, nbcluster=3:5, models=mixmodGaussianModel(family = "spherical"), nbcores=8)

Model Summary

# Summary of the selected model

Result print

# print clustering and regression parameters 

Variable selection in classification

# Discriminant analysis with learning and testing data
# Variable selection with parallel computing (8 cores)
xl <- x[1:1900,]; xt <- x[1901:2000,] 
zl <- z[1:1900]; zt <- z[1901:2000]
obj <- SelvarLearnLasso(x=xl, z=zl, models=mixmodGaussianModel(family = "spherical"), xtest=xt, ztest=zt,nbcores=8)

Model Summary

# Summary of the selected model

Result print

# print clustering and regression parameters 

Celeux, Gilles, Cathy Maugis-Rabusseau, and Mohammed Sedki. 2018. “Variable Selection in Model-Based Clustering and Discriminant Analysis with a Regularization Approach.” Advances in Data Analysis and Classification, April. doi:10.1007/s11634-018-0322-5.

Maugis, C., G. Celeux, and M.-L. Martin-Magniette. 2009. “Variable Selection in Model-Based Clustering: A General Variable Role Modeling.” Computational Statistics and Data Analysis 53: 3872–82.

———. 2011. “Variable Selection in Model-Based Discriminant Analysis.” Journal of Multivariate Analysis 102: 1374–87.