Subject: Statistics Paper: Multivariate Analysis Module: Introduction to Principal Components Analysis 1 / 17 Development Team Principal investigator: Dr. Bhaswati Ganguli, Professor, Department of Statistics, University of Calcutta Paper co-ordinator: Dr. Sugata SenRoy, Professor, Department of Statistics, University of Calcutta Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian Institute of Public Health, Hyderabad Content reviewer: Dr. Kalyan Das, Professor, Department of Statistics, University of Calcutta 2 / 17 Development Team Principal investigator: Dr. Bhaswati Ganguli, Professor, Department of Statistics, University of Calcutta Paper co-ordinator: Dr. Sugata SenRoy, Professor, Department of Statistics, University of Calcutta Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian Institute of Public Health, Hyderabad Content reviewer: Dr. Kalyan Das, Professor, Department of Statistics, University of Calcutta 2 / 17 Development Team Principal investigator: Dr. Bhaswati Ganguli, Professor, Department of Statistics, University of Calcutta Paper co-ordinator: Dr. Sugata SenRoy, Professor, Department of Statistics, University of Calcutta Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian Institute of Public Health, Hyderabad Content reviewer: Dr. Kalyan Das, Professor, Department of Statistics, University of Calcutta 2 / 17 Development Team Principal investigator: Dr. Bhaswati Ganguli, Professor, Department of Statistics, University of Calcutta Paper co-ordinator: Dr. Sugata SenRoy, Professor, Department of Statistics, University of Calcutta Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian Institute of Public Health, Hyderabad Content reviewer: Dr. Kalyan Das, Professor, Department of Statistics, University of Calcutta 2 / 17 Dimension Rduction techniques Question Is it necessary to look at all the variables, or is it possible for a smaller set of variables to capture the same information ? Some of the techniques used are : I Principal Component Analysis I Factor Analysis I Canonical Correlations I Multidimensional Scaling We will discuss Principal Component Analysis in this lecture. 3 / 17 Dimension Rduction techniques Question Is it necessary to look at all the variables, or is it possible for a smaller set of variables to capture the same information ? Some of the techniques used are : I Principal Component Analysis I Factor Analysis I Canonical Correlations I Multidimensional Scaling We will discuss Principal Component Analysis in this lecture. 3 / 17 Dimension Rduction techniques Question Is it necessary to look at all the variables, or is it possible for a smaller set of variables to capture the same information ? Some of the techniques used are : I Principal Component Analysis I Factor Analysis I Canonical Correlations I Multidimensional Scaling We will discuss Principal Component Analysis in this lecture. 3 / 17 Dimension Rduction techniques Question Is it necessary to look at all the variables, or is it possible for a smaller set of variables to capture the same information ? Some of the techniques used are : I Principal Component Analysis I Factor Analysis I Canonical Correlations I Multidimensional Scaling We will discuss Principal Component Analysis in this lecture. 3 / 17 Dimension Rduction techniques Question Is it necessary to look at all the variables, or is it possible for a smaller set of variables to capture the same information ? Some of the techniques used are : I Principal Component Analysis I Factor Analysis I Canonical Correlations I Multidimensional Scaling We will discuss Principal Component Analysis in this lecture. 3 / 17 Dimension Rduction techniques Question Is it necessary to look at all the variables, or is it possible for a smaller set of variables to capture the same information ? Some of the techniques used are : I Principal Component Analysis I Factor Analysis I Canonical Correlations I Multidimensional Scaling We will discuss Principal Component Analysis in this lecture. 3 / 17 Dimension Rduction techniques Question Is it necessary to look at all the variables, or is it possible for a smaller set of variables to capture the same information ? Some of the techniques used are : I Principal Component Analysis I Factor Analysis I Canonical Correlations I Multidimensional Scaling We will discuss Principal Component Analysis in this lecture. 3 / 17 Introduction Principal Component Analysis A principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through a fewer linear combinations of these variables. I The technique helps to capture the variability of a set of variables through a few linear combinations of the variables. I These linear combinations of the variables contain most of the information contained in the original set. 4 / 17 Introduction Principal Component Analysis A principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through a fewer linear combinations of these variables. I The technique helps to capture the variability of a set of variables through a few linear combinations of the variables. I These linear combinations of the variables contain most of the information contained in the original set. 4 / 17 Objectives of principal component analysis I Data reduction : To replace the original set of variables by a smaller set of p-variables, which retains a large proportion of the total variability of the original variables. I I Interpretation : To reveal relationships that were not previously suspected. I I Need to look at fewer variables rather than a long list of variables and hence easier to work with. Easier to capture the underlying mechanism of the model and hence easier to interpret from. Can also serve as an intermediate step in much larger investigations like cluster analysis or multiple regression. 5 / 17 Objectives of principal component analysis I Data reduction : To replace the original set of variables by a smaller set of p-variables, which retains a large proportion of the total variability of the original variables. I I Interpretation : To reveal relationships that were not previously suspected. I I Need to look at fewer variables rather than a long list of variables and hence easier to work with. Easier to capture the underlying mechanism of the model and hence easier to interpret from. Can also serve as an intermediate step in much larger investigations like cluster analysis or multiple regression. 5 / 17 Objectives of principal component analysis I Data reduction : To replace the original set of variables by a smaller set of p-variables, which retains a large proportion of the total variability of the original variables. I I Interpretation : To reveal relationships that were not previously suspected. I I Need to look at fewer variables rather than a long list of variables and hence easier to work with. Easier to capture the underlying mechanism of the model and hence easier to interpret from. Can also serve as an intermediate step in much larger investigations like cluster analysis or multiple regression. 5 / 17 Objectives of principal component analysis I Data reduction : To replace the original set of variables by a smaller set of p-variables, which retains a large proportion of the total variability of the original variables. I I Interpretation : To reveal relationships that were not previously suspected. I I Need to look at fewer variables rather than a long list of variables and hence easier to work with. Easier to capture the underlying mechanism of the model and hence easier to interpret from. Can also serve as an intermediate step in much larger investigations like cluster analysis or multiple regression. 5 / 17 Objectives of principal component analysis I Data reduction : To replace the original set of variables by a smaller set of p-variables, which retains a large proportion of the total variability of the original variables. I I Interpretation : To reveal relationships that were not previously suspected. I I Need to look at fewer variables rather than a long list of variables and hence easier to work with. Easier to capture the underlying mechanism of the model and hence easier to interpret from. Can also serve as an intermediate step in much larger investigations like cluster analysis or multiple regression. 5 / 17 A Simple Example Let X1 = Height and X2 = Weight. Q : Should we look at both these two variables ? New variables Consider the linear combinations Y1 = a11 X1 + a12 X2 Y2 = a21 X1 + a22 X2 Instead of looking at both X1 and X2 , can we reasonably look at either Y1 or Y2 only, depending on which has the larger variability. Note We can only have two linearly independent combinations of X1 and X2 . 6 / 17 A Simple Example Let X1 = Height and X2 = Weight. Q : Should we look at both these two variables ? New variables Consider the linear combinations Y1 = a11 X1 + a12 X2 Y2 = a21 X1 + a22 X2 Instead of looking at both X1 and X2 , can we reasonably look at either Y1 or Y2 only, depending on which has the larger variability. Note We can only have two linearly independent combinations of X1 and X2 . 6 / 17 Example (contd.) I Here we have the plot of X2 against X1 (with bases changed). I Shows fair amount of variability in both X1 and X2 . 7 / 17 Example (contd.) I Here we have the plot of X2 against X1 (with bases changed). I Shows fair amount of variability in both X1 and X2 . 7 / 17 Example (contd.) I We now rotate the axes to Y1 and Y2 (in red). I Shows that the major variability is captured by Y1 . I So can discard Y2 and consider the single variable Y1 instead of X1 and X2 . 8 / 17 Example (contd.) I We now rotate the axes to Y1 and Y2 (in red). I Shows that the major variability is captured by Y1 . I So can discard Y2 and consider the single variable Y1 instead of X1 and X2 . 8 / 17 Example (contd.) I We now rotate the axes to Y1 and Y2 (in red). I Shows that the major variability is captured by Y1 . I So can discard Y2 and consider the single variable Y1 instead of X1 and X2 . 8 / 17 Population Principal Components Geometrical Significance Principal components represent the selection of a new coordinate system with coordinates Y1 , Y2 , . . . , Ym obtained by rotating the original system with X1 , X2 , . . . , Xm as the coordinate axes. The new axes represent the directions with maximum variability and provide a simpler and more parsimonious description of the covariance structure. 9 / 17 Population Principal Components Geometrical Significance Principal components represent the selection of a new coordinate system with coordinates Y1 , Y2 , . . . , Ym obtained by rotating the original system with X1 , X2 , . . . , Xm as the coordinate axes. The new axes represent the directions with maximum variability and provide a simpler and more parsimonious description of the covariance structure. 9 / 17 Population Principal Components I Let X = [X1 , X2 , . . . , Xm ]. I Let Cov(X) = Σ I Let the eigenvalues of Σ be λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0. I Consider the linear combinations Y1 .. . = a10 X .. . = a11 X1 + a12 X2 + · · · + a1m Xm .. . Ym = am0 X = am1 X1 + am2 X2 + · · · + amm Xm I Y1 , Y2 , . . . , Ym , chosen orthogonal to each other and with decreasing variability, are the principal components. 10 / 17 Population Principal Components I Let X = [X1 , X2 , . . . , Xm ]. I Let Cov(X) = Σ I Let the eigenvalues of Σ be λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0. I Consider the linear combinations Y1 .. . = a10 X .. . = a11 X1 + a12 X2 + · · · + a1m Xm .. . Ym = am0 X = am1 X1 + am2 X2 + · · · + amm Xm I Y1 , Y2 , . . . , Ym , chosen orthogonal to each other and with decreasing variability, are the principal components. 10 / 17 Population Principal Components I Let X = [X1 , X2 , . . . , Xm ]. I Let Cov(X) = Σ I Let the eigenvalues of Σ be λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0. I Consider the linear combinations Y1 .. . = a10 X .. . = a11 X1 + a12 X2 + · · · + a1m Xm .. . Ym = am0 X = am1 X1 + am2 X2 + · · · + amm Xm I Y1 , Y2 , . . . , Ym , chosen orthogonal to each other and with decreasing variability, are the principal components. 10 / 17 Population Principal Components I Let X = [X1 , X2 , . . . , Xm ]. I Let Cov(X) = Σ I Let the eigenvalues of Σ be λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0. I Consider the linear combinations Y1 .. . = a10 X .. . = a11 X1 + a12 X2 + · · · + a1m Xm .. . Ym = am0 X = am1 X1 + am2 X2 + · · · + amm Xm I Y1 , Y2 , . . . , Ym , chosen orthogonal to each other and with decreasing variability, are the principal components. 10 / 17 Population Principal Components I Let X = [X1 , X2 , . . . , Xm ]. I Let Cov(X) = Σ I Let the eigenvalues of Σ be λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0. I Consider the linear combinations Y1 .. . = a10 X .. . = a11 X1 + a12 X2 + · · · + a1m Xm .. . Ym = am0 X = am1 X1 + am2 X2 + · · · + amm Xm I Y1 , Y2 , . . . , Ym , chosen orthogonal to each other and with decreasing variability, are the principal components. 10 / 17 How to choose the aj ’s I Observe that V ar(Yj ) = aj0 Σaj , Cov(Yj , Yk ) = aj0 Σak , j = 1, 2, . . . , m j, k = 1, 2, . . . , m I The first principal component is the linear combination with maximum variance i.e. it maximizes Var(Y1 ) = a10 Σa1 . I Since Var(Y1 ) can be increased by multiplying a1 by some constant, it is convenient to restrict attention to coefficient vectors of unit length only. 11 / 17 How to choose the aj ’s I Observe that V ar(Yj ) = aj0 Σaj , Cov(Yj , Yk ) = aj0 Σak , j = 1, 2, . . . , m j, k = 1, 2, . . . , m I The first principal component is the linear combination with maximum variance i.e. it maximizes Var(Y1 ) = a10 Σa1 . I Since Var(Y1 ) can be increased by multiplying a1 by some constant, it is convenient to restrict attention to coefficient vectors of unit length only. 11 / 17 How to choose the aj ’s I Observe that V ar(Yj ) = aj0 Σaj , Cov(Yj , Yk ) = aj0 Σak , j = 1, 2, . . . , m j, k = 1, 2, . . . , m I The first principal component is the linear combination with maximum variance i.e. it maximizes Var(Y1 ) = a10 Σa1 . I Since Var(Y1 ) can be increased by multiplying a1 by some constant, it is convenient to restrict attention to coefficient vectors of unit length only. 11 / 17 Choosing the Yj ’s I The coefficient a1 of the first principal component Y1 = a1 X is so chosen as to maximize V ar(a1 X) I subject to the condition a01 a1 = 1. The coefficient a2 of the second principal component Y2 = a2 X is so chosen as to maximize V ar(a2 X) subject to a02 a2 = 1 and a01 a2 = 0. I In general, for j = 1, . . . , m, the coefficient aj of the j th principal component Yj = aj X is so chosen as to maximize V ar(aj X) subject to a0j aj = 1 and a0k aj = 0 for all k < j. 12 / 17 Choosing the Yj ’s I The coefficient a1 of the first principal component Y1 = a1 X is so chosen as to maximize V ar(a1 X) I subject to the condition a01 a1 = 1. The coefficient a2 of the second principal component Y2 = a2 X is so chosen as to maximize V ar(a2 X) subject to a02 a2 = 1 and a01 a2 = 0. I In general, for j = 1, . . . , m, the coefficient aj of the j th principal component Yj = aj X is so chosen as to maximize V ar(aj X) subject to a0j aj = 1 and a0k aj = 0 for all k < j. 12 / 17 Choosing the Yj ’s I The coefficient a1 of the first principal component Y1 = a1 X is so chosen as to maximize V ar(a1 X) I subject to the condition a01 a1 = 1. The coefficient a2 of the second principal component Y2 = a2 X is so chosen as to maximize V ar(a2 X) subject to a02 a2 = 1 and a01 a2 = 0. I In general, for j = 1, . . . , m, the coefficient aj of the j th principal component Yj = aj X is so chosen as to maximize V ar(aj X) subject to a0j aj = 1 and a0k aj = 0 for all k < j. 12 / 17 Advantages Result V ar(X1 ) + . . . + V ar(Xm ) = V ar(Y1 ) + . . . + V ar(Ym ) The advantages are I The Y-variances have decreasing order i.e. V ar(Y1 ) ≥ V ar(Y2 ) ≥ . . . ≥ V ar(Ym ) I Unlike Cov(Xj , Xk ), Cov(Yj , Yk ) = 0 for all j 6= k. Strategy Work with only p << m principal components Y instead of all X if these explain a large portion of X’s variability (say 80% or 90%). 13 / 17 Advantages Result V ar(X1 ) + . . . + V ar(Xm ) = V ar(Y1 ) + . . . + V ar(Ym ) The advantages are I The Y-variances have decreasing order i.e. V ar(Y1 ) ≥ V ar(Y2 ) ≥ . . . ≥ V ar(Ym ) I Unlike Cov(Xj , Xk ), Cov(Yj , Yk ) = 0 for all j 6= k. Strategy Work with only p << m principal components Y instead of all X if these explain a large portion of X’s variability (say 80% or 90%). 13 / 17 Advantages Result V ar(X1 ) + . . . + V ar(Xm ) = V ar(Y1 ) + . . . + V ar(Ym ) The advantages are I The Y-variances have decreasing order i.e. V ar(Y1 ) ≥ V ar(Y2 ) ≥ . . . ≥ V ar(Ym ) I Unlike Cov(Xj , Xk ), Cov(Yj , Yk ) = 0 for all j 6= k. Strategy Work with only p << m principal components Y instead of all X if these explain a large portion of X’s variability (say 80% or 90%). 13 / 17 Advantages Result V ar(X1 ) + . . . + V ar(Xm ) = V ar(Y1 ) + . . . + V ar(Ym ) The advantages are I The Y-variances have decreasing order i.e. V ar(Y1 ) ≥ V ar(Y2 ) ≥ . . . ≥ V ar(Ym ) I Unlike Cov(Xj , Xk ), Cov(Yj , Yk ) = 0 for all j 6= k. Strategy Work with only p << m principal components Y instead of all X if these explain a large portion of X’s variability (say 80% or 90%). 13 / 17 Example I Let us illustrate the idea of principal components through an example. I Consider the amount of protein (X1 ), carbohydrate (X2 ), fat (X3 ), calorie (X4 ) and vitamin (X5 ) contents of 12 selected cereal brands (data source : Johnson & Wichern, 2002). I The purpose is to see if, instead of these 5 variables, we can observe just one or two constructed variables (principal components) which will synthesize the data without losing on its variability. I The method by which we construct these principal components will be discussed in later lectures. 14 / 17 Example I Let us illustrate the idea of principal components through an example. I Consider the amount of protein (X1 ), carbohydrate (X2 ), fat (X3 ), calorie (X4 ) and vitamin (X5 ) contents of 12 selected cereal brands (data source : Johnson & Wichern, 2002). I The purpose is to see if, instead of these 5 variables, we can observe just one or two constructed variables (principal components) which will synthesize the data without losing on its variability. I The method by which we construct these principal components will be discussed in later lectures. 14 / 17 Example I Let us illustrate the idea of principal components through an example. I Consider the amount of protein (X1 ), carbohydrate (X2 ), fat (X3 ), calorie (X4 ) and vitamin (X5 ) contents of 12 selected cereal brands (data source : Johnson & Wichern, 2002). I The purpose is to see if, instead of these 5 variables, we can observe just one or two constructed variables (principal components) which will synthesize the data without losing on its variability. I The method by which we construct these principal components will be discussed in later lectures. 14 / 17 Example I Let us illustrate the idea of principal components through an example. I Consider the amount of protein (X1 ), carbohydrate (X2 ), fat (X3 ), calorie (X4 ) and vitamin (X5 ) contents of 12 selected cereal brands (data source : Johnson & Wichern, 2002). I The purpose is to see if, instead of these 5 variables, we can observe just one or two constructed variables (principal components) which will synthesize the data without losing on its variability. I The method by which we construct these principal components will be discussed in later lectures. 14 / 17 Example (contd.) The 5 principal components come out as follows : Y1 = 0X1 + 0X2 + 0X3 + 0.228X4 + 0.972X5 Y2 = 0X1 − 0.176X2 + 0X3 − 0.956X4 + 0.233X5 Y3 = 0.561X1 − 0.814X2 + 0.216X3 + 0X4 + 0X5 Y4 = −0.816X1 − 0.521X2 + 0.973X3 + 0.129X4 + 0X5 Y5 = 0.136X1 + 0.179X2 + 0X3 + 0.126X4 + 0X5 15 / 17 Example (contd.) We next look at the variances, and proportion and cumulative proportion of variability explained by the Yj ’s. Y1 Y2 Y3 Y4 Y5 s.d. 31.82 16.17 2.43 0.51 0.40 prop. of Var 0.79 0.20 0.007 0.002 0.001 Cum prop. of Var 0.79 0.99 0.997 0.999 1.0 I The first principal component explains 79% of the variability in the X’s. I If we want to capture more variability of the original variables, then the first and second principal components together account for 99% of the total X-variability. I Thus instead of the 5 X variables, we can work with only 2 principal components Y1 and Y2 . 16 / 17 Example (contd.) We next look at the variances, and proportion and cumulative proportion of variability explained by the Yj ’s. Y1 Y2 Y3 Y4 Y5 s.d. 31.82 16.17 2.43 0.51 0.40 prop. of Var 0.79 0.20 0.007 0.002 0.001 Cum prop. of Var 0.79 0.99 0.997 0.999 1.0 I The first principal component explains 79% of the variability in the X’s. I If we want to capture more variability of the original variables, then the first and second principal components together account for 99% of the total X-variability. I Thus instead of the 5 X variables, we can work with only 2 principal components Y1 and Y2 . 16 / 17 Example (contd.) We next look at the variances, and proportion and cumulative proportion of variability explained by the Yj ’s. Y1 Y2 Y3 Y4 Y5 s.d. 31.82 16.17 2.43 0.51 0.40 prop. of Var 0.79 0.20 0.007 0.002 0.001 Cum prop. of Var 0.79 0.99 0.997 0.999 1.0 I The first principal component explains 79% of the variability in the X’s. I If we want to capture more variability of the original variables, then the first and second principal components together account for 99% of the total X-variability. I Thus instead of the 5 X variables, we can work with only 2 principal components Y1 and Y2 . 16 / 17 Summary I The different data reduction techniques are stated. I The idea of principal component analysis is explained. I The derivation of principal components is shown. I The technique is illustrated through an example. 17 / 17 Summary I The different data reduction techniques are stated. I The idea of principal component analysis is explained. I The derivation of principal components is shown. I The technique is illustrated through an example. 17 / 17 Summary I The different data reduction techniques are stated. I The idea of principal component analysis is explained. I The derivation of principal components is shown. I The technique is illustrated through an example. 17 / 17 Summary I The different data reduction techniques are stated. I The idea of principal component analysis is explained. I The derivation of principal components is shown. I The technique is illustrated through an example. 17 / 17