TY - JOUR

T1 - Bayesian variable selection for globally sparse probabilistic PCA

AU - Bouveyron, Charles

AU - Latouche, Pierre

AU - Mattei, Pierre-Alexandre

PY - 2018

Y1 - 2018

N2 - Sparse versions of principal component analysis (PCA) have imposed themselves as simple, yet powerful ways of selecting relevant features of high-dimensional data in an unsupervised manner. However, when several sparse principal components are computed, the interpretation of the selected variables may be difficult since each axis has its own sparsity pattern and has to be interpreted separately. To overcome this drawback, we propose a Bayesian procedure that allows to obtain several sparse components with the same sparsity pattern. This allows the practitioner to identify which original variables are most relevant to describe the data. To this end, using Roweis’ probabilistic interpretation of PCA and an isotropic Gaussian prior on the loading matrix, we provide the first exact computation of the marginal likelihood of a Bayesian PCA model. Moreover, in order to avoid the drawbacks of discrete model selection, a simple relaxation of this framework is presented. It allows to find a path of candidate models using a variational expectation-maximization algorithm. The exact marginal likelihood can eventually be maximized over this path, relying on Occam’s razor to select the relevant variables. Since the sparsity pattern is common to all components, we call this approach globally sparse probabilistic PCA (GSPPCA). Its usefulness is illustrated on synthetic data sets and on several real unsupervised feature selection problems coming from signal processing and genomics. In particular, using unlabeled microarray data, GSPPCA is shown to infer biologically relevant subsets of genes. According to a metric based on pathway enrichment, it vastly surpasses in this context the performance of traditional sparse PCA algorithms. An R implementation of the GSPPCA algorithm is available at http://github.com/pamattei/GSPPCA.

AB - Sparse versions of principal component analysis (PCA) have imposed themselves as simple, yet powerful ways of selecting relevant features of high-dimensional data in an unsupervised manner. However, when several sparse principal components are computed, the interpretation of the selected variables may be difficult since each axis has its own sparsity pattern and has to be interpreted separately. To overcome this drawback, we propose a Bayesian procedure that allows to obtain several sparse components with the same sparsity pattern. This allows the practitioner to identify which original variables are most relevant to describe the data. To this end, using Roweis’ probabilistic interpretation of PCA and an isotropic Gaussian prior on the loading matrix, we provide the first exact computation of the marginal likelihood of a Bayesian PCA model. Moreover, in order to avoid the drawbacks of discrete model selection, a simple relaxation of this framework is presented. It allows to find a path of candidate models using a variational expectation-maximization algorithm. The exact marginal likelihood can eventually be maximized over this path, relying on Occam’s razor to select the relevant variables. Since the sparsity pattern is common to all components, we call this approach globally sparse probabilistic PCA (GSPPCA). Its usefulness is illustrated on synthetic data sets and on several real unsupervised feature selection problems coming from signal processing and genomics. In particular, using unlabeled microarray data, GSPPCA is shown to infer biologically relevant subsets of genes. According to a metric based on pathway enrichment, it vastly surpasses in this context the performance of traditional sparse PCA algorithms. An R implementation of the GSPPCA algorithm is available at http://github.com/pamattei/GSPPCA.

KW - Principal components

KW - Sparsity

KW - Unsupervised learning

KW - variable selection

U2 - 10.1214/18-EJS1450

DO - 10.1214/18-EJS1450

M3 - Journal article

VL - 12

SP - 3036

EP - 3070

JO - Electronic Journal of Statistics

JF - Electronic Journal of Statistics

SN - 1935-7524

ER -