Principal Component Analysis [book]

Parinya Sanguansat
2012
Principal component analysis (pca) is a multivariate technique that analyzes a data table in which observations are described by several inter-correlated quantitative dependent variables. Its goal is to extract the important information from the table, to represent it as a set of new orthogonal variables called principal components, and to display the pattern of similarity of the observations and of the variables as points in maps. The quality of the pca model can be evaluated using
more » ... tion techniques such as the bootstrap and the jackknife. Pca can be generalized as correspondence analysis (ca) in order to handle qualitative variables and as multiple factor analysis (mfa) in order to handle heterogenous sets of variables. Mathematically, pca depends upon the eigen-decomposition of positive semi-definite matrices and upon the singular value decomposition (svd) of rectangular matrices. Send correspondence to Hervé Abdi: herve@utdallas.edu www.utdallas.edu/∼herve · We would like to thank Yoshio Takane for his very helpful comments on a draft of this paper. Prerequisite notions and notations Matrices are denoted in upper case bold, vectors are denoted in lower case bold, and elements are denoted in lower case italic. Matrices, vectors, and elements from the same matrix all use the same letter (e.g., A, a, a). The transpose operation is denoted by the superscript T . The identity matrix is denoted I. The data table to be analyzed by pca comprises I observations described by J variables and it is represented by the I × J matrix X, whose generic element is x i,j . The matrix X has rank L where L ≤ min {I, J}. In general, the data table will be pre-processed before the analysis. Almost always, the columns of X will be centered so that the mean of each column is equal to 0 (i.e., X T 1 = 0, where 0 is a J by 1 vector of zeros and 1 is an I by 1 vector of ones). If in addition, each element of X is divided by √ I (or √ I − 1), the analysis is referred to as a covariance pca because, in this case, the matrix X T X is a covariance matrix. In addition to centering, when the variables are measured with different units, it is customary to standardize each variable to unit norm. This is obtained by dividing each variable by its norm (i.e., the square root of the sum of all the squared elements of this variable). In this case, the analysis is referred to as a correlation pca because, then, the matrix X T X is a correlation matrix (most statistical packages use correlation preprocessing as a default).