{"id":2655559,"date":"2023-05-16T10:00:27","date_gmt":"2023-05-16T14:00:27","guid":{"rendered":"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/plato-data\/principal-component-analysis-pca-with-scikit-learn-kdnuggets\/"},"modified":"2023-05-16T10:00:27","modified_gmt":"2023-05-16T14:00:27","slug":"principal-component-analysis-pca-with-scikit-learn-kdnuggets","status":"publish","type":"station","link":"https:\/\/platodata.io\/plato-data\/principal-component-analysis-pca-with-scikit-learn-kdnuggets\/","title":{"rendered":"Principal Component Analysis (PCA) with Scikit-Learn &#8211; KDnuggets"},"content":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/05\/principal-component-analysis-pca-with-scikit-learn-kdnuggets.png\" alt=\"Principal Component Analysis (PCA) with Scikit-Learn\" width=\"100%\"><br \/><span>Image by Author<br \/><\/span><br \/>&nbsp; <\/p>\n<p>If you\u2019re familiar with the unsupervised learning paradigm, you\u2019d have come across dimensionality reduction and the algorithms used for dimensionality reduction such as the <strong>principal component analysis<\/strong> (PCA). Datasets for machine learning typically contain a large number of features, but such high-dimensional feature spaces are not always helpful.<\/p>\n<p>In general, all the features are <i>not<\/i> equally important and there are certain features that account for a large percentage of variance in the dataset. Dimensionality reduction algorithms aim to reduce the dimension of the feature space to a fraction of the original number of dimensions. In doing so, the features with high variance are still retained\u2014but are in the transformed feature space. And principal component analysis (PCA) is one of the most popular dimensionality reduction algorithms.<\/p>\n<p>In this tutorial, we\u2019ll learn how principal component analysis (PCA) works and how to implement it using the scikit-learn library.<\/p>\n<p>Before we go ahead and implement principal component analysis (PCA) in&nbsp; scikit-learn, it\u2019s helpful to understand how PCA works.<\/p>\n<p>As mentioned, principal component analysis is a dimensionality reduction algorithm. Meaning it reduces the dimensionality of the feature space. But how does it achieve this reduction?<\/p>\n<p>The motivation behind the algorithm is that there are certain features that capture a large percentage of variance in the original dataset. So it&#8217;s important to find the <strong>directions of maximum variance<\/strong> in the dataset. These directions are called <strong>principal components<\/strong>. And PCA is essentially a projection of the dataset onto the principal components.<\/p>\n<p>So how do we find the principal components?&nbsp;<\/p>\n<p>Suppose the data matrix X is of dimensions <strong>num_observations x num_features<\/strong>, we perform <a href=\"https:\/\/en.wikipedia.org\/wiki\/Eigendecomposition_of_a_matrix\" rel=\"noopener\" target=\"_blank\">eigenvalue decomposition<\/a> on the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Covariance_matrix\" rel=\"noopener\" target=\"_blank\">covariance matrix<\/a> of X.<\/p>\n<p>If the features are all zero mean, then the covariance matrix is given by X.T X. Here, X.T is the transpose of the matrix X. If the features are not all zero mean initially, we can subtract the mean of column i from each entry in that column and compute the covariance matrix. It\u2019s simple to see that the covariance matrix is a square matrix of order <strong>num_features<\/strong>.<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/05\/principal-component-analysis-pca-with-scikit-learn-kdnuggets-1.png\" alt=\"Principal Component Analysis (PCA) with Scikit-Learn\" width=\"100%\"><br \/><span>Image by Author<\/span><br \/>&nbsp; <\/p>\n<p>The first k principal components are the <i>eigenvectors<\/i> corresponding to the <i>k largest eigenvalues<\/i>.&nbsp;<\/p>\n<p>So the steps in PCA can be summarized as follows:<br \/>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/05\/principal-component-analysis-pca-with-scikit-learn-kdnuggets-2.png\" alt=\"Principal Component Analysis (PCA) with Scikit-Learn\" width=\"100%\"><br \/><span>Image by Author<\/span><br \/>&nbsp; <\/p>\n<p>Because the covariance matrix is a symmetric and positive semi-definite, the eigendecomposition takes the following form:<\/p>\n<p>X.T X = D \u039b D.T<\/p>\n<p>Where, D is the matrix of eigenvectors and \u039b is a diagonal matrix of eigenvalues.<\/p>\n<p>Another matrix factorization technique that can be used to compute principal components is singular value decomposition or SVD.&nbsp;<\/p>\n<p>Singular value decomposition (SVD) is defined for all matrices. Given a matrix X, SVD of X gives: X = U \u03a3 V.T. Here, U, \u03a3, and V are the matrices of left singular vectors, singular values, and right singular vectors, respectively. V.T. is the transpose of V.&nbsp;<\/p>\n<p>So the SVD of the covariance matrix of X is given by:<br \/>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/05\/principal-component-analysis-pca-with-scikit-learn-kdnuggets-3.png\" alt=\"Principal Component Analysis (PCA) with Scikit-Learn\" width=\"50%\"><br \/>&nbsp;<br \/>Comparing the equivalence of the two matrix decompositions:<br \/>&nbsp;<br \/><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/05\/principal-component-analysis-pca-with-scikit-learn-kdnuggets-4.png\" alt=\"Principal Component Analysis (PCA) with Scikit-Learn\" width=\"50%\"><br \/>&nbsp; <\/p>\n<p>We have the following:&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/05\/principal-component-analysis-pca-with-scikit-learn-kdnuggets-5.png\" alt=\"Principal Component Analysis (PCA) with Scikit-Learn\" width=\"70%\"><br \/>&nbsp; <\/p>\n<p>There are computationally efficient algorithms for calculating the SVD of a matrix. The scikit-learn implementation of PCA also uses SVD under the hood to compute the principal components.<\/p>\n<p>Now that we\u2019ve learned the basics of principal component analysis, let\u2019s proceed with the scikit-learn implementation of the same.<\/p>\n<h2>Step 1 \u2013 Load the Dataset<\/h2>\n<p>To understand how to implement principal component analysis, let\u2019s use a simple dataset. In this tutorial, we\u2019ll use the wine dataset available as part of scikit-learn&#8217;s <strong>datasets<\/strong> module.<\/p>\n<p>Let\u2019s start by loading and preprocessing the dataset:<\/p>\n<div>\n<pre><code>from sklearn import datasets\nwine_data = datasets.load_wine(as_frame=True)\ndf = wine_data.data<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>It has 13 features and 178 records in all.<\/p>\n<div>\n<pre><code>print(df.shape)\nOutput &gt;&gt; (178, 13)<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<div>\n<pre><code>print(df.info())\nOutput &gt;&gt;\n\nRangeIndex: 178 entries, 0 to 177\nData columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 alcohol 178 non-null float64 1 malic_acid 178 non-null float64 2 ash 178 non-null float64 3 alcalinity_of_ash 178 non-null float64 4 magnesium 178 non-null float64 5 total_phenols 178 non-null float64 6 flavanoids 178 non-null float64 7 nonflavanoid_phenols 178 non-null float64 8 proanthocyanins 178 non-null float64 9 color_intensity 178 non-null float64 10 hue 178 non-null float64 11 od280\/od315_of_diluted_wines 178 non-null float64 12 proline 178 non-null float64\ndtypes: float64(13)\nmemory usage: 18.2 KB\nNone<\/code><\/pre>\n<\/div>\n<h2>Step 2 \u2013 Preprocess the Dataset<\/h2>\n<p>As a next step, let&#8217;s preprocess the dataset. The features are all on different scales. To bring them all to a common scale, we\u2019ll use the <code>StandardScaler<\/code> that transforms the features to have zero mean and unit variance:<\/p>\n<div>\n<pre><code>from sklearn.preprocessing import StandardScaler\nstd_scaler = StandardScaler()\nscaled_df = std_scaler.fit_transform(df)<\/code><\/pre>\n<\/div>\n<h2>Step 3 \u2013 Perform PCA on the Preprocessed Dataset<\/h2>\n<p>To find the principal components, we can use the PCA class from scikit-learn\u2019s <strong>decomposition<\/strong> module.<\/p>\n<p>Let\u2019s instantiate a PCA object by passing in the number of principal components <code>n_components<\/code> to the constructor.&nbsp;<\/p>\n<p>The number of principal components is the number of dimensions that you\u2019d like to reduce the feature space to. Here, we set the number of components to 3.<\/p>\n<div>\n<pre><code>from sklearn.decomposition import PCA\npca = PCA(n_components=3)\npca.fit_transform(scaled_df)<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>Instead of calling the <code>fit_transform()<\/code> method, you can also call <code>fit()<\/code> followed by the <code>transform()<\/code> method.<\/p>\n<p>Notice how the steps in principal component analysis such as computing the covariance matrix, performing eigendecomposition or singular value decomposition on the covariance matrix to get the principal components have all been abstracted away when we use scikit-learn\u2019s implementation of PCA.<\/p>\n<h2>Step 4 \u2013 Examining Some Useful Attributes of the PCA Object<\/h2>\n<p>The PCA instance <code>pca<\/code> that we created has several useful attributes that help us understand what is going on under the hood.<\/p>\n<p>The attribute <code>components_<\/code> stores the directions of maximum variance (the principal components).<\/p>\n<div>\n<pre><code>print(pca.components_)<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<div>\n<pre><code>Output &gt;&gt;\n[[ 0.1443294 -0.24518758 -0.00205106 -0.23932041 0.14199204 0.39466085 0.4229343 -0.2985331 0.31342949 -0.0886167 0.29671456 0.37616741 0.28675223] [-0.48365155 -0.22493093 -0.31606881 0.0105905 -0.299634 -0.06503951 0.00335981 -0.02877949 -0.03930172 -0.52999567 0.27923515 0.16449619 -0.36490283] [-0.20738262 0.08901289 0.6262239 0.61208035 0.13075693 0.14617896 0.1506819 0.17036816 0.14945431 -0.13730621 0.08522192 0.16600459 -0.12674592]]<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>We mentioned that the principal components are directions of maximum variance in the dataset. But how do we measure <i>how much of the total variance<\/i> is captured in the number of principal components we just chose?<\/p>\n<p>The <code>explained_variance_ratio_<\/code> attribute captures the ratio of the total variance each principal component captures. Sowe can sum up the ratios to get the total variance in the chosen number of components.<\/p>\n<div>\n<pre><code>print(sum(pca.explained_variance_ratio_))<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<div>\n<pre><code>Output &gt;&gt; 0.6652996889318527<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>Here, we see that three principal components capture over 66.5% of total variance in the dataset.<\/p>\n<h2>Step 5 \u2013 Analyzing the Change in Explained Variance Ratio<\/h2>\n<p>We can try running principal component analysis by varying the number of components <code>n_components<\/code>.<\/p>\n<div>\n<pre><code>import numpy as np\nnums = np.arange(14)<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<div>\n<pre><code>var_ratio = []\nfor num in nums: pca = PCA(n_components=num) pca.fit(scaled_df) var_ratio.append(np.sum(pca.explained_variance_ratio_))<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>To visualize the <code>explained_variance_ratio_<\/code> for the number of components, let\u2019s plot the two quantities as shown:<\/p>\n<div>\n<pre><code>import matplotlib.pyplot as plt plt.figure(figsize=(4,2),dpi=150)\nplt.grid()\nplt.plot(nums,var_ratio,marker='o')\nplt.xlabel('n_components')\nplt.ylabel('Explained variance ratio')\nplt.title('n_components vs. Explained Variance Ratio')<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>When we use all the 13 components, the <code>explained_variance_ratio_<\/code> is 1.0 indicating that we\u2019ve captured 100% of the variance in the dataset.&nbsp;<\/p>\n<p>In this example, we see that with 6 principal components, we&#8217;ll be able to capture more than 80% of variance in the input dataset.<br \/>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/05\/principal-component-analysis-pca-with-scikit-learn-kdnuggets-6.png\" alt=\"Principal Component Analysis (PCA) with Scikit-Learn\" width=\"70%\"> <\/p>\n<p>I hope you\u2019ve learned how to perform principal component analysis using built-in functionality in the scikit-learn library. Next, you can try to implement PCA on a dataset of your choice. If you\u2019re looking for good datasets to work with, check out this list of <a href=\"https:\/\/www.kdnuggets.com\/2023\/04\/10-websites-get-amazing-data-data-science-projects.html\" rel=\"noopener\" target=\"_blank\">websites to find datasets for your data science projects<\/a>.<\/p>\n<p>[1] <a href=\"https:\/\/github.com\/fastai\/numerical-linear-algebra\/blob\/master\/README.md\" rel=\"noopener\" target=\"_blank\">Computational Linear Algebra<\/a>, fast.ai<br \/>&nbsp;<br \/>&nbsp;<br \/><b><a href=\"https:\/\/www.linkedin.com\/in\/bala-priya\/\" target=\"_blank\" rel=\"noopener\">Bala Priya C<\/a><\/b> is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she&#8217;s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.<\/p>\n<div class=\"crp_related crp-text-only\"><\/p>\n<h3>More On This Topic<\/h3>\n<\/div>\n<ul class=\"plato-post-bottom-links\">\n<li class=\"plato-post-bottom-link-amplifi\">SEO Powered Content &amp; PR Distribution. <a href=\"https:\/\/www.amplifipr.com\" target=\"_blank\" rel=\"noopener\">Get Amplified Today.<\/a><\/li>\n<li class=\"plato-post-bottom-link-platoaistream\">PlatoAiStream. Web3 Data Intelligence. Knowledge Amplified. <a href=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\" target=\"_blank\" rel=\"noopener\">Access Here.<\/a><\/li>\n<li class=\"plato-post-bottom-link-mintingthefuture\">Minting the Future w Adryenn Ashley. <a href=\"https:\/\/mintingthefuture.com\" target=\"_blank\" rel=\"noopener\">Access Here.<\/a><\/li>\n<li class=\"plato-post-bottom-link-preipo\">Buy and Sell Shares in PRE-IPO Companies with PREIPO\u00ae. <a href=\"https:\/\/www.preipo.com\/?utm_source=plato&amp;utm_medium=plato&amp;utm_campaign=plato\" target=\"_blank\" rel=\"noopener\">Access Here.<\/a><\/li>\n<li class=\"plato-post-bottom-link-source\"><span>Source:<\/span> <a href=\"https:\/\/www.kdnuggets.com\/2023\/05\/principal-component-analysis-pca-scikitlearn.html?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=principal-component-analysis-pca-with-scikit-learn\" target=\"_blank\" rel=\"noopener\">https:\/\/www.kdnuggets.com\/2023\/05\/principal-component-analysis-pca-scikitlearn.html?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=principal-component-analysis-pca-with-scikit-learn<\/a><\/li>\n<\/ul>\n","protected":false},"author":1,"featured_media":2655560,"template":"Default","meta":{"_eb_attr":"","type":"","auto_type":false,"post":"","stream":"","stream_url":"","waveform_data":[],"duration":0,"start":0,"end":0,"bpm":0,"downloadable":false,"download_url":"","purchase_title":"","purchase_url":"","post-count-all":0,"like_count":0,"download_count":0,"editor_note":"","copyright":"","captions":[],"sources":[]},"genre":[42022],"station_tag":[21246,38012,33649,34957,21579,10944,3759,69356,69363,69364,68872,3629,48576,4262,13775,31352,4263,12626,9087,12627,69602,19769,4339,4044,3761,68956,12467,4045,40507,18340,48551,34374,48552,48553,11007,42276,9837,69526,5243,48554,68863,14207,13083,10228,69017,48555,13833,3886,24510,39889,48563,11501,11838,12200,4147,39832,12939,47134,12353,11499,4963,5644,17500,11364,4442,3690,5619,13807,5216,5256,6978,16593,12195,11290,11002,11927,3642,8865,40775,47689,5167,43888,9163,6486,13827,49446,69099,69339,39854,14017,10727,39856,40183,11930,46346,9166,9417,24487,12916,6997,3729,4382,30994,3772,39959,9874,12646,48556,4177,41709,13776,9839,4072,40035,40168,40059,19651,9167,4491,5428,48557,9238,10060,69342,3732,3801,21664,14049,3694,4185,18595,3650,48567,69344,9267,10903,22734,9365,19265,10371,4693,48084,39837,11381,13669,3652,11694,3653,69019,43209,4597,4572,3806,11367,4318,4573,4010,69928,5388,48580,4620,3911,69388,4965,12523,9169,10223,4089,5763,15911,8465,40547,69345,10001,3740,8690,11767,40145,41087,10997,9483,4673,33711,69523,10517,43581,16265,14094,10785,18469,33849,48559,48560,17454,4731,68946,3662,14287,69348,48879,68865,4712,12344,12074,9171,40330,12019,13672,7457,8453,9642,4017,10871,7934,11810,5195,9424,22345,4502,4109,4110,40122,70050,47471,70535,4674,48565,13653,4117,41689,4118,15397,69188,3976,12941,12461,12254,69362,12345,4461,41618,3779,11458,3977,9434,3873,3714,21389,31196,13751,4261,39843,24465,4126,68954,48561,13215,5247,48569,69447,28675,48566,68948,13569,19267,25082,13285,70380,36021,6313,69591,10873,13074,9872,21542,43592,3984,37458,10367,68866,69023,3782,24736,48579,68959,9178,4368,69189,4518,28790,48562,3935,9608,4240,5152,467,4927,68949,13217,14555,4668],"artist":[42028],"mood":[],"activity":[],"_links":{"self":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/station\/2655559"}],"collection":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/station"}],"about":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/types\/station"}],"author":[{"embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/users\/1"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/media\/2655560"}],"wp:attachment":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/media?parent=2655559"}],"wp:term":[{"taxonomy":"genre","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/genre?post=2655559"},{"taxonomy":"station_tag","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/station_tag?post=2655559"},{"taxonomy":"artist","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/artist?post=2655559"},{"taxonomy":"mood","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/mood?post=2655559"},{"taxonomy":"activity","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/activity?post=2655559"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}