Data Science : Making sense of High Dimensional Data using Visualization methods — Part I

Understanding the intuition and mathematics behind mapping high dimension spaces to low dimension spaces using Principal Component Analysis

Debayan Mitra
7 min readMay 28, 2022
#ref — https://experiments.withgoogle.com/visualizing-high-dimensional-space

Table of Contents —

1) An example of a High Dimension dataset

2) Understanding the Intuition behind visualizing the high dimension data into a low dimension space

3) Dimensionality Reduction Technique : PCA

  • PCA Intuition
  • PCA Mathematics and algorithm
  • PCA Application on the dataset
  • PCA final notes

1) An example of a High Dimension dataset

To keep things simple, we won’t take very high dimension datasets since the goal of this blog is to understand the workings. We will take the Breast cancer Wisconsin dataset (by UCI Machine Learning repository).

Link to dataset — https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

This dataset contains 30 data points collected from medical examination of 529 patients on various aspects of the medical examination points for detecting wether a patient has breast cancer detected as Malignant (or) Benign. Below screenshot contains sample medical data points collected on each patient and the target column represents whether the detected tumor was malignant or benign. This dataset will contain 529 rows (one row for each patient) and 31 columns (30 medical points + 1 target (benign/malignant).

sample points

Let us take one of the rows from the above dataset. The 30 medical characteristic dimensions and 1 tumor point for this patient is summarized below —

Now, each row can be represented as 30-dimension vector as below —

where each component is the magnitude of the dimension being represented and since there are 30 dimensions, we have a vector of length 30.

2) Understanding the Intuition behind visualizing the high dimension data into a low dimension space

Now, you as a human being might find this quite confusing because you see that you have 30 medical points with you and you want to see if there exists certain dimensions where benign and malignant tumors can be distinguished correctly. So, you start plotting some visualizations as follows —

Plots — 1, 2, 3

Now, you start picking 2 random dimensions (because we humans are used to visualizing data in 2 dimension spaces more often in common) from the 30 available medical dimensions for all 529 points and you are interested in seeing whether certain dimensions can help distinguish between the benign and malignant tumors or not. As you can notice, Plot 1 — (where you plot the Mean Radius (vs) Worst Compactness) — you see quite a clear distinction between malignant and benign patients. The same is not very clearly distinguished in plot 2 and somewhat better in plot 3.

But the problem is that you have 30 dimensions to pick from and thereby a total of (30 Choose 2) combinations to pick from — so a total of 435 plots to analyze and visualize the data in a 2 dimension space.

# ref — https://www.aplustopper.com/combination-formula-ncr/
number of combinations

Now, unless you are a superhuman, it’s humanely impossible to visualize all the 435 plots of the data and you take a call on what medical points are giving a near best representation of the dataset.

Thereby, We need a way to map the 30 dimension space (HIGH DIMENSION) to a 2 dimension space (LOW DIMENSION) whilst preserving maximum information possible from the 30 dimensions into the 2 dimensions so that we form a better understanding of the dataset.

The mathematical techniques used to convert such High Dimension spaces to Low dimension spaces whilst preserving maximum information possible during the space compression are called Dimensionality Reduction Techniques.

Please note that preserving information is also important else we might end up with information loss due to compression of our dimensions.

3) Dimensionality Reduction Technique : PCA

PCA Intuition —

PCA stands for Principal Component Analysis. Intuitively, what it does is it performs an orthogonal transformation of axis(if you recollect class 11,12 mathematics) of the existing co-ordinate space such that it preserves the maximum information in the 1st transformed co-ordinate space. This is called the first Principal Component. So, basically —

30 old dimensions — — -> 30 ordered new dimensions

The new dimension space is ordered such that the 1st dimension in the new space has the maximum information followed by 2nd dimension in the new space followed by 3rd and so on. But you might ask, what is the point of doing this if all we are doing is getting the same dimension in the new transformed space. This is because you have the flexibility to choose the new dimension space such that you decide how much information you want to preserve. So, Let’s say after the transformation, the first 5 new dimensions preserve 90% of information present in the dataset. So, You can just choose the first 5 Principal components with a 10% information loss. Information in this case is quantified by the variance in the dataset along that particular dimension.

We can understand by looking at the below example —

# ref — https://www.youtube.com/watch?v=SoRFuRVQjlo

As you can see form the above diagram, the original ‘X’ and ‘Y’ axis have been orthogonally transformed to ‘PCA1’ and ‘PCA2’ space. Now, we can see that PCA1 has a lot of variance in the dataset whereas PCA2 has very less variance compared to PCA1. So, we just take PCA1 and skip PCA2 since you do not get much information in PCA 2 but a lot more information in PCA 1thereby obtaining a lower dimension representation of the original dataset.

PCA Mathematics and Algorithm—

Now , we can deep dive into how these principal components are obtained. So, our objective is basically perform a rotation of axes in the given D-dimension space such that the variance in the first dimension in the new co-ordinate space is maximized. So , Mathematically our objective becomes —

Now, we calculate the variance of the projected data points in the first Principal component. Now, with a bit of High school mathematics like linear algebra and matrix operations, we can see that our objective is nothing but takes the matrix form uTSu where S is the covariance matrix of the original dataset and u1 is the vector in the direction of the first Principal Component.

Now, framing this as an optimization problem, the objective function is to maximize the variance in the first PCA. We can impose a basis constraint since the basis vectors in the new dimension space are all unit vectors. This becomes a constrained optimization problem that can be solved using Lagrange multipliers. If you want a good understanding of how Lagrange multipliers can solve an optimization problem, You can take a look at this video which is really well explained —

# ref — https://www.youtube.com/watch?v=8mjcnxGMwFo

So, our objective function mathematically becomes —

We can see that the solution of the above new principal component vector is nothing but the eigen vector of the Covariance Matrix of the original dataset and lambda is the corresponding eigen value.

Here is a simple implementation of the PCA in python —

PCA — code snippet

PCA Application on the dataset

On applying PCA to the above dataset, We see the following results —

The first PCA explains 98.25% of variance, first PCA + second PCA explains ~99.75% variance in the dataset. Thereby we can easily transform the dataset from a 30D to a 2D by losing only 0.25% of information in the dataset.

Here is how we calculate the cumulative variance explained since PC’s have an ordered form (PC 1 > PC 2 > PC 3 >…….PC 30) since the corresponding eigen values (EV) are ordered (EV 1 > EV 2 > …… .EV 30). Thereby the cumulative sum ratio should give the % of variance explained if we have to chose the first ‘x’ Principal components as the dimensionality to retain in the dataset.

# ref — https://medium.com/analytics-vidhya/principal-component-analysis-pca-112157c2d691

We can visualize the same by transforming the original datasets into the transformed PC spaces by projecting the original data points into the new principal components.

As we can see below, this graph below shows a fair amount of seperation between the 2 tumor types in a 2D space (which is the transformed PC space) and also accounting for roughly 99.25% of information present in the original dataset.

This graph above seems to be a better visualization in the 2 D space than instead of randomly selecting 2 features from the available 30 feature space.

PCA Final notes

PCA is not a fool proof method. It has its many limitations because PCA only tries to preserve the global structure of the data and not the local structure. So any specific non linear pertubations in the dataset in the local neighbourhoods cannot be captured in the Principal Components.

--

--