Here’s a bunch of Interview Questions asked on Principal Component Analysis
- Explain the Curse of Dimensionality?
The curse of dimensionality refers to all the problems that arise working with data in the higher dimensions. As the number of features increase, the number of samples increases, hence, the model becomes more complex. The more the number of features, the more the chances of overfitting. A machine learning model that is trained on a large number of features, gets increasingly dependent on the data it was trained on and in turn overfitted, resulting in poor performance on real data, beating the purpose. The fewer features our training data has, the lesser assumptions our model makes and the simpler it will be.
2. Why do we need dimensionality reduction? What are its drawbacks?
In Machine Learning, dimension refers to the number of features. Dimensionality reduction is simply, the process of reducing the dimension of your feature set.
Advantages of Dimensionality Reduction:
- Less misleading data means model accuracy improves.
- Fewer dimensions mean less computing. Less data means that algorithms train faster.
- Less data means less storage space required.
- Removes redundant features and noise.
- Dimensionality Reduction helps us visualize the data on 2D plots or 3D plots.
Drawbacks of Dimensionality Reduction are:
- Some information is lost, possibly degrading the performance of subsequent training algorithms.
- It can be computationally intensive.
- Transformed features are often hard to interpret.
- It makes the independent variables less interpretable.
3. Explain Principal Component Analysis, assumptions, equations.
Principal Component Analysis (PCA) is an unsupervised dimensionality reduction algorithm. It identifies the hyperplane that lies closest to the data, and then it projects the data onto it preserving the variance.
Here is a result of the projection of a dataset onto each of these axes. As you can see, the projection onto the solid line preserves the maximum variance, while the projection onto the dotted line preserves very little variance, and the projection onto the dashed line preserves an intermediate amount of variance.
PCA selects the axis which preserves maximum variance in the training set. PCA finds as many axes as the number of dimensions such that every axis is orthogonal to each other.
- First, we calculate the covariance matrix
2. Decompose the covariance matrix and find the eigenvalues and corresponding eigenvectors
Eigen Value and Eigen Vector
Decompose the Covariance Matrix
3. We arrange the eigenvalues as follows
Each Eigenvalue corresponds to a principal component. The highest eigenvalue corresponds to the 1st Principal Component and so on.
- There needs to be a linear relationship between all variables. The reason for this assumption is that a PCA is based on Pearson correlation coefficients, and as such, there needs to be a linear relationship between the variables
- You should have sampling adequacy, which simply means that for PCA to produce a reliable result, large enough sample sizes are required.
- Your data should be suitable for data reduction. Effectively, you need to have adequate correlations between the variables to be reduced to a smaller number of components.
- There should be no significant outliers.
4. Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?
PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear because it can at least get rid of useless dimensions. However, if there are no useless dimensions, reducing dimensionality with PCA will lose too much information.
5. Limitations of PCA?
- Doesn’t work well for non linearly correlated data.
Here the data points are not linearly correlated. PCA won’t work well.
2. PCA always finds orthogonal principal components. Sometimes, our data demands non-orthogonal principal components to represent the data.
The green colour vectors are principal components. But, the actual maximum variance directions are blue colour vectors. Standard PCA fails to find that vectors.
3. PCA always considered the low variance components in the data as noise and recommend us to throw away those components. But, sometimes those components play a major role in a supervised learning task.
4. If the variables are correlated, PCA can achieve dimension reduction. If not, PCA just orders them according to their variances
6. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?
Yes, rotation (orthogonal) is necessary to account the maximum variance of the training set. If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the training set.
7. Is it important to standardize before applying PCA?
PCA finds new directions based on the covariance matrix of original variables. Since the covariance matrix is sensitive to the standardization of variables. Usually, we do standardization to assign equal weights to all the variables. If we use features of different scales, we get misleading directions. But, it is not necessary to standardize the variables, if all the variables are on the same scale.
8. Should one remove highly correlated variables before doing PCA?
No, PCA loads out all highly correlated variables on the same Principal Component(Eigenvector), not different ones.
9. What will happen when eigenvalues are roughly equal?
If all eigenvectors are same then PCA won’t be able to select the principal components because in that case, all principal components are equal.
10. How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the dataset without losing too much information. Alternatively, if you are using dimensionality reduction as a preprocessing step before another Machine Learning algorithm (e.g., a Random Forest classifier), then you can simply measure the performance of that second algorithm; if dimensionality reduction did not lose too much information, then the algorithm should perform just as well as when using the original dataset.