{"id":1941381,"date":"2023-01-30T12:00:35","date_gmt":"2023-01-30T17:00:35","guid":{"rendered":"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/plato-data\/how-to-effectively-use-pandas-groupby\/"},"modified":"2023-01-30T12:00:35","modified_gmt":"2023-01-30T17:00:35","slug":"how-to-effectively-use-pandas-groupby","status":"publish","type":"station","link":"https:\/\/platodata.io\/plato-data\/how-to-effectively-use-pandas-groupby\/","title":{"rendered":"How to Effectively Use Pandas GroupBy"},"content":{"rendered":"<p>Pandas is a powerful and widely-used open-source library for data manipulation and analysis using Python. One of its key features is the ability to group data using the groupby function by splitting a DataFrame into groups based on one or more columns and then applying various aggregation functions to each one of them.<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"100%\"><br \/><span>Image from <a href=\"https:\/\/unsplash.com\/photos\/_9a-3NO5KJE\" rel=\"noopener\" target=\"_blank\">Unsplash<\/a><\/span><br \/>&nbsp; <\/p>\n<p>The <code>groupby<\/code> function is incredibly powerful, as it allows you to quickly summarize and analyze large datasets. For example, you can group a dataset by a specific column and calculate the mean, sum, or count of the remaining columns for each group. You can also group by multiple columns to get a more granular understanding of your data. Additionally, it allows you to apply custom aggregation functions, which can be a very powerful tool for complex data analysis tasks.<\/p>\n<p>In this tutorial, you will learn how to use the groupby function in Pandas to group different types of data and perform different aggregation operations. By the end of this tutorial, you should be able to use this function to analyze and summarize data in various ways.<\/p>\n<p>Concepts are internalized when practiced well and this is what we are going to do next i.e. get hands-on with Pandas groupby function. It is recommended to use a <a href=\"https:\/\/jupyter.org\/\" rel=\"noopener\" target=\"_blank\">Jupyter Notebook<\/a> for this tutorial as you are able to see the output at each step.<\/p>\n<h2>Generate Sample Data<\/h2>\n<p>Import the following libraries:<\/p>\n<ul>\n<li>Pandas: To create a dataframe and apply group by\n<\/li>\n<li>Random &#8211; To generate random data\n<\/li>\n<li>Pprint &#8211; To print dictionaries\n<\/li>\n<\/ul>\n<div>\n<pre><code>import pandas as pd\nimport random\nimport pprint<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>Next, we will initialize an empty dataframe and fill in values for each column as shown below:<\/p>\n<div>\n<pre><code>df = pd.DataFrame()\nnames = [ \"Sankepally\", \"Astitva\", \"Shagun\", \"SURAJ\", \"Amit\", \"RITAM\", \"Rishav\", \"Chandan\", \"Diganta\", \"Abhishek\", \"Arpit\", \"Salman\", \"Anup\", \"Santosh\", \"Richard\",\n] major = [ \"Electrical Engineering\", \"Mechanical Engineering\", \"Electronic Engineering\", \"Computer Engineering\", \"Artificial Intelligence\", \"Biotechnology\",\n] yr_adm = random.sample(list(range(2018, 2023)) * 100, 15)\nmarks = random.sample(range(40, 101), 15)\nnum_add_sbj = random.sample(list(range(2)) * 100, 15) df[\"St_Name\"] = names\ndf[\"Major\"] = random.sample(major * 100, 15)\ndf[\"yr_adm\"] = yr_adm\ndf[\"Marks\"] = marks\ndf[\"num_add_sbj\"] = num_add_sbj\ndf.head()\n<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>Bonus tip \u2013 a cleaner way to do the same task is by creating a dictionary of all variables and values and later converting it to a dataframe.<\/p>\n<div>\n<pre><code>student_dict = { \"St_Name\": [ \"Sankepally\", \"Astitva\", \"Shagun\", \"SURAJ\", \"Amit\", \"RITAM\", \"Rishav\", \"Chandan\", \"Diganta\", \"Abhishek\", \"Arpit\", \"Salman\", \"Anup\", \"Santosh\", \"Richard\", ], \"Major\": random.sample( [ \"Electrical Engineering\", \"Mechanical Engineering\", \"Electronic Engineering\", \"Computer Engineering\", \"Artificial Intelligence\", \"Biotechnology\", ] * 100, 15, ), \"Year_adm\": random.sample(list(range(2018, 2023)) * 100, 15), \"Marks\": random.sample(range(40, 101), 15), \"num_add_sbj\": random.sample(list(range(2)) * 100, 15),\n}\ndf = pd.DataFrame(student_dict)\ndf.head()\n<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>The dataframe looks like the one shown below. When running this code, some of the values won\u2019t match as we are using a random sample.<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-1.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"60%\"> <\/p>\n<h2>Making Groups<\/h2>\n<p>Let\u2019s group the data by the \u201cMajor\u201d subject and apply the group filter to see how many records fall into this group.<\/p>\n<div>\n<pre><code>groups = df.groupby('Major')\ngroups.get_group('Electrical Engineering')<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>So, four students belong to the Electrical Engineering major.<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-2.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"60%\"><br \/>&nbsp; <\/p>\n<p>You can also group by more than one column (Major and num_add_sbj in this case).&nbsp;<\/p>\n<div>\n<pre><code>groups = df.groupby(['Major', 'num_add_sbj'])<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>Note that all the aggregate functions that can be applied to groups with one column can be applied to groups with multiple columns. For the rest of the tutorial, let\u2019s focus on the different types of aggregations using a single column as an example.<\/p>\n<p>Let\u2019s create groups using groupby on the \u201cMajor\u201d column.<\/p>\n<div>\n<pre><code>groups = df.groupby('Major')<\/code><\/pre>\n<\/div>\n<h2>Applying Direct Functions<\/h2>\n<p>Let\u2019s say you want to find the average marks in each Major. What would you do?&nbsp;<\/p>\n<ul>\n<li>Choose Marks column\n<\/li>\n<li>Apply mean function\n<\/li>\n<li>Apply round function to round off marks to two decimal places (optional)\n<\/li>\n<\/ul>\n<div>\n<pre><code>groups['Marks'].mean().round(2)<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<div>\n<pre><code>Major\nArtificial Intelligence 63.6\nComputer Engineering 45.5\nElectrical Engineering 71.0\nElectronic Engineering 92.0\nMechanical Engineering 64.5\nName: Marks, dtype: float64<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<h2>Aggregate<\/h2>\n<p>Another way to achieve the same result is by using an aggregate function as shown below:<\/p>\n<div>\n<pre><code>groups['Marks'].aggregate('mean').round(2)<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p>You can also apply multiple aggregations to the groups by passing the functions as a list of strings.<\/p>\n<div>\n<pre><code>groups['Marks'].aggregate(['mean', 'median', 'std']).round(2)<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-3.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"50%\"><br \/>&nbsp; <\/p>\n<p>But what if you need to apply a different function to a different column. Don\u2019t worry. You can also do that by passing {column: function} pair.<\/p>\n<div>\n<pre><code>groups.aggregate({'Year_adm': 'median', 'Marks': 'mean'})<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-4.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"50%\"> <\/p>\n<h2>Transforms<\/h2>\n<p>You may very well need to perform custom transformations to a particular column which can be easily achieved using groupby(). Let\u2019s define a standard scalar similar to the one available in sklearn\u2019s preprocessing module. You can transform all the columns by calling the transform method and passing the custom function.<\/p>\n<div>\n<pre><code>def standard_scalar(x): return (x - x.mean())\/x.std()\ngroups.transform(standard_scalar)<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-5.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"30%\"><br \/>&nbsp; <\/p>\n<p>Note that \u201cNaN\u201d represents groups with zero standard deviation.<\/p>\n<h2>Filter<\/h2>\n<p>You may want to check which \u201cMajor\u201d is underperforming i.e. the one where average student \u201cMarks\u201d are less than 60. It requires you to apply a filter method to groups with a function inside it. The below code uses a <a href=\"https:\/\/www.kdnuggets.com\/2023\/01\/python-lambda-functions-explained.html\" rel=\"noopener\" target=\"_blank\">lambda function<\/a> to achieve the filtered results.<\/p>\n<div>\n<pre><code>groups.filter(lambda x: x['Marks'].mean() 60)<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-6.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"60%\"> <\/p>\n<h2>First<\/h2>\n<p>It gives you its first instance sorted by index.<\/p>\n<div>\n<pre><code>groups.first()<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-7.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"60%\"> <\/p>\n<h2>Describe<\/h2>\n<p>The \u201cdescribe\u201d method returns basic statistics like count, mean, std, min, max, etc. for the given columns.<\/p>\n<div>\n<pre><code>groups['Marks'].describe()<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-8.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"60%\"> <\/p>\n<h2>Size<\/h2>\n<p>Size, as the name suggests, returns the size of each group in terms of the number of records.<\/p>\n<div>\n<pre><code>groups.size()<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<div>\n<pre><code>Major\nArtificial Intelligence 5\nComputer Engineering 2\nElectrical Engineering 4\nElectronic Engineering 2\nMechanical Engineering 2\ndtype: int64<\/code><\/pre>\n<\/div>\n<h2>Count and Nunique<\/h2>\n<p>\u201cCount\u201d returns all values whereas \u201cNunique\u201d returns only the unique values in that group.<\/p>\n<div>\n<pre><code>groups.count()<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-9.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"60%\"><\/p>\n<p>&nbsp; <\/p>\n<div>\n<pre><code>groups.nunique()<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-10.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"60%\"> <\/p>\n<h2>Rename<\/h2>\n<p>You can also rename the aggregated columns&#8217; name as per your preference.<\/p>\n<div>\n<pre><code>groups.aggregate(\"median\").rename( columns={ \"yr_adm\": \"median year of admission\", \"num_add_sbj\": \"median additional subject count\", }\n)\n<\/code><\/pre>\n<\/div>\n<p>&nbsp; <\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/02\/how-to-effectively-use-pandas-groupby-11.jpg\" alt=\"How to Effectively Use Pandas GroupBy\" width=\"60%\"><br \/>&nbsp; <\/p>\n<ul>\n<li><strong>Be clear on the purpose of the groupby:<\/strong> Are you trying to group the data by one column to get the mean of another column? Or are you trying to group the data by multiple columns to get the count of the rows in each group?\n<\/li>\n<li><strong>Understand the indexing of the data frame:<\/strong> The groupby function uses the index to group the data. If you want to group the data by a column, make sure that the column is set as the index or you can use .set_index()\n<\/li>\n<li><strong>Use the appropriate aggregate function<\/strong>: It can be used with various aggregation functions like mean(), sum(), count(), min(), max()\n<\/li>\n<li><strong>Use the as_index parameter:<\/strong> When set to False, this parameter tells pandas to use the grouped columns as regular columns instead of index.\n<\/li>\n<\/ul>\n<p>You can also use groupby() in conjunction with other pandas functions like pivot_table(), crosstab(), and cut() to extract more insights from your data.<\/p>\n<p>A groupby function is a powerful tool for data analysis and manipulation as it allows you to group rows of data based on one or more columns and then perform aggregate calculations on the groups. The tutorial demonstrated various ways to use the groupby function with the help of code examples. Hope it provides you with an understanding of the different options that come with it and also how they help in the data analysis.<\/p>\n<p>&nbsp;<br \/>&nbsp;<br \/><b><a href=\"https:\/\/vidhi-chugh.medium.com\/\" target=\"_blank\" rel=\"noopener\">Vidhi Chugh<\/a><\/b> is an AI strategist and a digital transformation leader working at the intersection of product, sciences, and engineering to build scalable machine learning systems. She is an award-winning innovation leader, an author, and an international speaker. She is on a mission to democratize machine learning and break the jargon for everyone to be a part of this transformation.<br \/>&nbsp;<\/p>\n<div class=\"crp_related crp-text-only\"><\/p>\n<h3>More On This Topic<\/h3>\n<\/div>\n<ul class=\"plato-post-bottom-links\">\n<li class=\"plato-post-bottom-link-amplifi\">SEO Powered Content &amp; PR Distribution. <a href=\"https:\/\/www.amplifipr.com\" target=\"_blank\" rel=\"noopener\">Get Amplified Today.<\/a><\/li>\n<li class=\"plato-post-bottom-link-platoblockchain\">Platoblockchain. Web3 Metaverse Intelligence. Knowledge Amplified. <a href=\"https:\/\/platoblockchain.com\" target=\"_blank\" rel=\"noopener\">Access Here.<\/a><\/li>\n<li class=\"plato-post-bottom-link-source\"><span>Source:<\/span> <a href=\"https:\/\/www.kdnuggets.com\/2023\/01\/effectively-pandas-groupby.html?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=how-to-effectively-use-pandas-groupby\" target=\"_blank\" rel=\"noopener\">https:\/\/www.kdnuggets.com\/2023\/01\/effectively-pandas-groupby.html?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=how-to-effectively-use-pandas-groupby<\/a><\/li>\n<\/ul>\n","protected":false},"author":1,"featured_media":1941382,"template":"Default","meta":{"_eb_attr":"","type":"","auto_type":false,"post":"","stream":"","stream_url":"","waveform_data":[],"duration":0,"start":0,"end":0,"bpm":0,"downloadable":false,"download_url":"","purchase_title":"","purchase_url":"","post-count-all":0,"like_count":0,"download_count":0,"editor_note":"","copyright":"","captions":[],"sources":[]},"genre":[42022],"station_tag":[10944,3785,37811,46185,3629,4262,13775,13275,31352,12626,9885,3942,39848,40813,4043,3761,13519,4045,13699,18340,9773,13651,12974,10063,12194,9728,4136,40046,9837,9254,40858,30993,40025,12341,7458,29071,4526,47685,39889,14043,39832,40050,3720,5644,17500,11364,9226,12206,41082,11928,4152,11006,3642,6543,40775,40545,10905,27253,9163,3835,5374,11046,22089,14017,9227,13295,43723,13526,4445,12620,9706,9166,12390,40638,24367,4382,40314,34902,30994,3772,4068,9874,11180,13776,4691,12621,9007,40035,40168,40059,9167,3847,12023,12972,9238,14049,3694,4185,18595,3650,22734,19265,12219,3695,4617,4080,13669,40062,4189,3954,11694,3653,42088,43209,3805,3806,10874,4318,4010,40235,5388,3911,20810,10223,4089,3959,13777,18392,39875,4092,34143,14411,43732,4477,40145,4598,41087,10997,11457,23391,4884,39917,14094,10785,17454,4207,4100,4101,3663,15910,40421,12344,12189,12074,40330,31253,7457,8453,9642,9925,18347,3939,9057,10510,15313,10098,42598,41194,4109,9880,11481,9612,12623,4260,39868,4433,12142,3810,41026,9360,4674,13653,9717,5678,3976,39869,12345,9232,13654,3712,11458,4583,13785,11377,3749,21389,33974,5014,26421,12890,19080,40835,4125,40350,10316,5048,13219,13215,12530,19267,10956,9006,4129,13657,36021,12348,12068,8955,10367,24736,11001,13218,9178,36520,27168,9608,11461,4927,3717,13217,14555,4668],"artist":[42028],"mood":[],"activity":[],"_links":{"self":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/station\/1941381"}],"collection":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/station"}],"about":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/types\/station"}],"author":[{"embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/users\/1"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/media\/1941382"}],"wp:attachment":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/media?parent=1941381"}],"wp:term":[{"taxonomy":"genre","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/genre?post=1941381"},{"taxonomy":"station_tag","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/station_tag?post=1941381"},{"taxonomy":"artist","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/artist?post=1941381"},{"taxonomy":"mood","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/mood?post=1941381"},{"taxonomy":"activity","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/activity?post=1941381"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}