SelvarMix

SelvarMix: A R package for variable selection in model-based clustering and discriminant analysis with a regularization approach.

Overview of the SelvarMix functions

This section presents the whole analysis of a simulated data set. It makes use all the functions implemented in the package SelvarMix and may be regarded as a tutorial.

The cluster analysis is performed with an unknown number of clusters. An information criterion is used for variable selection and choosing the number of clusters. The chosen model is described in a summary.

The synthetic dataset

The simulated dataset consists of 2000 data points in \(\mathbb{R}^{14}\). On the subset of relevant clustering variables \(S = \{1, 2\}\), data are distributed according to a mixture of four equiprobable spherical Gaussian distributions with means \((0,0), (3,0) (0,3)\) and \((3,3)\). The subset of redundant variables is \(U =\{3-11\}\). These variables are explained by the subset of predictor variables \(R = \{1,2\}\) through a linear regression. The last three variables \(W = \{12, 13, 14\}\) are independent. More details are given in (Maugis, Celeux, and Martin-Magniette 2009).

require(SelvarMix)
set.seed(123)
n <- 2000; p <- 14
x <- matrix(0,n, p)
x[,1] <- rnorm(n,0,1)
x[,2] <- rnorm(n,0,1)
z <-  sample(1:4, n, rep=T)
x[z==2, 1] <- x[z==2, 1] + 3
x[z==3, 2] <- x[z==3, 2] + 3
x[z==4, 1] <- x[z==4, 1] + 3
x[z==4, 2] <- x[z==4, 2] + 3

omega <- matrix(0, 9, 9); diag(omega)[1:3] <- rep(1,3); diag(omega)[4:5] <- rep(0.5,2)
rtmat1 <- matrix(c(cos(pi/3), -sin(pi/3), sin(pi/3), cos(pi/3)), ncol = 2, byrow = TRUE)
rtmat2 <- matrix(c(cos(pi/6), -sin(pi/6), sin(pi/6), cos(pi/6)), ncol = 2, byrow = TRUE)
omega[6:7, 6:7] <- t(rtmat1) %*% diag(c(1,3)) %*% rtmat1
omega[8:9, 8:9] <- t(rtmat2) %*% diag(c(2,6)) %*% rtmat2
b <- cbind(c(0.5,1), c(2,0), c(0,3), c(-1,2), c(2,-4), c(0.5,0), c(4,0.5), c(3,0), c(2,1))
x[,3:11] <- c(0, 0, seq(0.4, 2, len=7)) + x[,1:2]%*%b + t(t(chol(omega)) %*% matrix(rnorm(n*9), 9, n)) 
x[,12:14] <- matrix(rnorm(3*n), n, 3)
x[,12] <- x[,12] + 3.2; x[,13] <- x[,13] + 3.6; x[,13] <- x[,13] + 4

Go to the top

Variable selection and selection of the number of clusters in the clustering framework

# Cluster analysis with variable selection with parallel computing (8 cores) 
# The last two input arguments are optional
require(SelvarMix)
obj <- SelvarClustLasso(x=x, nbcluster=3:5, models=mixmodGaussianModel(family = "spherical"), nbcores=8)

Model Summary

# Summary of the selected model
summary(obj)

Go to the top

Result print

# print clustering and regression parameters 
print(obj)

Go to the top

Variable selection in classification

# Discriminant analysis with learning and testing data
# Variable selection with parallel computing (8 cores)
xl <- x[1:1900,]; xt <- x[1901:2000,] 
zl <- z[1:1900]; zt <- z[1901:2000]
obj <- SelvarLearnLasso(x=xl, z=zl, models=mixmodGaussianModel(family = "spherical"), xtest=xt, ztest=zt,nbcores=8)

Model Summary

# Summary of the selected model
summary(obj)

Go to the top

Result print

# print clustering and regression parameters 
print(obj)

Go to the top

Celeux, Gilles, Cathy Maugis-Rabusseau, and Mohammed Sedki. 2018. “Variable Selection in Model-Based Clustering and Discriminant Analysis with a Regularization Approach.” Advances in Data Analysis and Classification, April. doi:10.1007/s11634-018-0322-5.

Maugis, C., G. Celeux, and M.-L. Martin-Magniette. 2009. “Variable Selection in Model-Based Clustering: A General Variable Role Modeling.” Computational Statistics and Data Analysis 53: 3872–82.

———. 2011. “Variable Selection in Model-Based Discriminant Analysis.” Journal of Multivariate Analysis 102: 1374–87.

SelvarMix

Introduction

Overview of the SelvarMix functions