{"id":1941381,"date":"2023-01-30T12:00:35","date_gmt":"2023-01-30T17:00:35","guid":{"rendered":"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/plato-data\/how-to-effectively-use-pandas-groupby\/"},"modified":"2023-01-30T12:00:35","modified_gmt":"2023-01-30T17:00:35","slug":"how-to-effectively-use-pandas-groupby","status":"publish","type":"station","link":"https:\/\/platodata.io\/plato-data\/how-to-effectively-use-pandas-groupby\/","title":{"rendered":"How to Effectively Use Pandas GroupBy"},"content":{"rendered":"
Pandas is a powerful and widely-used open-source library for data manipulation and analysis using Python. One of its key features is the ability to group data using the groupby function by splitting a DataFrame into groups based on one or more columns and then applying various aggregation functions to each one of them.<\/p>\n
<\/p>\n
The In this tutorial, you will learn how to use the groupby function in Pandas to group different types of data and perform different aggregation operations. By the end of this tutorial, you should be able to use this function to analyze and summarize data in various ways.<\/p>\n Concepts are internalized when practiced well and this is what we are going to do next i.e. get hands-on with Pandas groupby function. It is recommended to use a Jupyter Notebook<\/a> for this tutorial as you are able to see the output at each step.<\/p>\n Import the following libraries:<\/p>\n <\/p>\n Next, we will initialize an empty dataframe and fill in values for each column as shown below:<\/p>\n <\/p>\n Bonus tip \u2013 a cleaner way to do the same task is by creating a dictionary of all variables and values and later converting it to a dataframe.<\/p>\n <\/p>\n The dataframe looks like the one shown below. When running this code, some of the values won\u2019t match as we are using a random sample.<\/p>\n <\/p>\n <\/p>\n Let\u2019s group the data by the \u201cMajor\u201d subject and apply the group filter to see how many records fall into this group.<\/p>\n <\/p>\n So, four students belong to the Electrical Engineering major.<\/p>\n <\/p>\n You can also group by more than one column (Major and num_add_sbj in this case). <\/p>\n <\/p>\n Note that all the aggregate functions that can be applied to groups with one column can be applied to groups with multiple columns. For the rest of the tutorial, let\u2019s focus on the different types of aggregations using a single column as an example.<\/p>\n Let\u2019s create groups using groupby on the \u201cMajor\u201d column.<\/p>\n Let\u2019s say you want to find the average marks in each Major. What would you do? <\/p>\n <\/p>\n <\/p>\n Another way to achieve the same result is by using an aggregate function as shown below:<\/p>\n <\/p>\n You can also apply multiple aggregations to the groups by passing the functions as a list of strings.<\/p>\n <\/p>\n But what if you need to apply a different function to a different column. Don\u2019t worry. You can also do that by passing {column: function} pair.<\/p>\n <\/p>\n <\/p>\n You may very well need to perform custom transformations to a particular column which can be easily achieved using groupby(). Let\u2019s define a standard scalar similar to the one available in sklearn\u2019s preprocessing module. You can transform all the columns by calling the transform method and passing the custom function.<\/p>\n <\/p>\n Note that \u201cNaN\u201d represents groups with zero standard deviation.<\/p>\n
Image from Unsplash<\/a><\/span>
<\/p>\ngroupby<\/code> function is incredibly powerful, as it allows you to quickly summarize and analyze large datasets. For example, you can group a dataset by a specific column and calculate the mean, sum, or count of the remaining columns for each group. You can also group by multiple columns to get a more granular understanding of your data. Additionally, it allows you to apply custom aggregation functions, which can be a very powerful tool for complex data analysis tasks.<\/p>\n
Generate Sample Data<\/h2>\n
\n
import pandas as pd\nimport random\nimport pprint<\/code><\/pre>\n<\/div>\n
df = pd.DataFrame()\nnames = [ \"Sankepally\", \"Astitva\", \"Shagun\", \"SURAJ\", \"Amit\", \"RITAM\", \"Rishav\", \"Chandan\", \"Diganta\", \"Abhishek\", \"Arpit\", \"Salman\", \"Anup\", \"Santosh\", \"Richard\",\n] major = [ \"Electrical Engineering\", \"Mechanical Engineering\", \"Electronic Engineering\", \"Computer Engineering\", \"Artificial Intelligence\", \"Biotechnology\",\n] yr_adm = random.sample(list(range(2018, 2023)) * 100, 15)\nmarks = random.sample(range(40, 101), 15)\nnum_add_sbj = random.sample(list(range(2)) * 100, 15) df[\"St_Name\"] = names\ndf[\"Major\"] = random.sample(major * 100, 15)\ndf[\"yr_adm\"] = yr_adm\ndf[\"Marks\"] = marks\ndf[\"num_add_sbj\"] = num_add_sbj\ndf.head()\n<\/code><\/pre>\n<\/div>\n
student_dict = { \"St_Name\": [ \"Sankepally\", \"Astitva\", \"Shagun\", \"SURAJ\", \"Amit\", \"RITAM\", \"Rishav\", \"Chandan\", \"Diganta\", \"Abhishek\", \"Arpit\", \"Salman\", \"Anup\", \"Santosh\", \"Richard\", ], \"Major\": random.sample( [ \"Electrical Engineering\", \"Mechanical Engineering\", \"Electronic Engineering\", \"Computer Engineering\", \"Artificial Intelligence\", \"Biotechnology\", ] * 100, 15, ), \"Year_adm\": random.sample(list(range(2018, 2023)) * 100, 15), \"Marks\": random.sample(range(40, 101), 15), \"num_add_sbj\": random.sample(list(range(2)) * 100, 15),\n}\ndf = pd.DataFrame(student_dict)\ndf.head()\n<\/code><\/pre>\n<\/div>\n
Making Groups<\/h2>\n
groups = df.groupby('Major')\ngroups.get_group('Electrical Engineering')<\/code><\/pre>\n<\/div>\n
<\/p>\ngroups = df.groupby(['Major', 'num_add_sbj'])<\/code><\/pre>\n<\/div>\n
groups = df.groupby('Major')<\/code><\/pre>\n<\/div>\n
Applying Direct Functions<\/h2>\n
\n
groups['Marks'].mean().round(2)<\/code><\/pre>\n<\/div>\n
Major\nArtificial Intelligence 63.6\nComputer Engineering 45.5\nElectrical Engineering 71.0\nElectronic Engineering 92.0\nMechanical Engineering 64.5\nName: Marks, dtype: float64<\/code><\/pre>\n<\/div>\n
Aggregate<\/h2>\n
groups['Marks'].aggregate('mean').round(2)<\/code><\/pre>\n<\/div>\n
groups['Marks'].aggregate(['mean', 'median', 'std']).round(2)<\/code><\/pre>\n<\/div>\n
<\/p>\ngroups.aggregate({'Year_adm': 'median', 'Marks': 'mean'})<\/code><\/pre>\n<\/div>\n
Transforms<\/h2>\n
def standard_scalar(x): return (x - x.mean())\/x.std()\ngroups.transform(standard_scalar)<\/code><\/pre>\n<\/div>\n
<\/p>\nFilter<\/h2>\n