Pandas DataFrame groupby() and agg()

In Pandas, the groupby() and agg() methods are closely related as groupby() is used to group the data in a DataFrame based on one or more columns and then the agg() method is used to perform aggregation operations on those groups.

After grouping the data using groupby(), you can use agg() to specify one or more functions to be applied to each group of the data. These functions can be built-in aggregation functions, such as mean(), sum(), min(), max(), etc., or custom functions defined by the user.

The agg() method applies the specified functions to each group and returns a new DataFrame with the aggregated data. The resulting DataFrame has a hierarchical index, where the first level corresponds to the grouping columns, and the second level corresponds to the columns on which the aggregation function was applied.

Overall, groupby() and agg() methods are two powerful tools in Pandas for data grouping and aggregation operations, which can help users extract meaningful insights and information from their data.

Let’s demonstrate how to use the groupby and agg methods in Pandas to perform data aggregation and transformation operations on a DataFrame:

import pandas as pd
import numpy as np

# Extended Aggregate Function
def transformed_mean(value): 
    value *= 100
    return value.mean()

# Create DataFrame
df = pd.DataFrame({
    'ID': [1, 1, 2, 2, 3, 3], 
    'Name': ['Danny', 'Adil', 'Andi', 'Mala', 'Zack', None], 
    'Point': [20, 21, 30, 11, 10, np.nan], 
    'Redeemed': [3, 5, 20, 0, 1, 1]
})
df['Name'] = df['Name'].astype(str)

# groupby() and agg()
df_grp = df.groupby(['ID'], as_index=False).agg({
    'Name': ['-'.join, 'sum', 'count', 'size'], 
    'Point': ['min', 'max', 'mean', 'sum', 'std', 'count', 'size'], 
    'Redeemed': ['sum', lambda x: (x * 2).sum(), transformed_mean]
})

# Rename columns for more descriptive name
df_grp.columns = ['ID', 'Name_join', 'Name_sum', 'Name_count', 'Name_size', 
                  'Point_min', 'Point_max', 'Point_mean', 'Point_sum', 'Point_std', 'Point_count', 'Point_size', 
                 'Redeemed_sum', 'Redeemed_lambda', 'Redeemed_transformed_mean']

df_grp

Please refer to the following image for a description of the script:

pandas_groupby_agg_function

Here’s a summary of what the script does:

  1. A DataFrame df is created using a Python dictionary, which contains columns for ID, Name, Point, and Redeemed. The DataFrame contains some missing values represented by np.nan.
  2. A custom function transformed_mean is defined to multiply a given value by 100 and then calculate its mean. This custom function demonstrates how to group data in a DataFrame and apply a custom function to each group.
  3. The groupby method is used to group the DataFrame df by the ID column. The agg method is then called on the grouped DataFrame to aggregate the data based on the specified functions. The results are saved in a new DataFrame df_grp.
  4. The columns in df_grp are renamed using the columns attribute to create more descriptive names.

In this Python script, you have learned how to group and aggregate data in a Pandas DataFrame using the groupby and agg methods. These methods allow you to summarize and transform data in useful ways, making it easier to draw insights from complex datasets. In this script, you saw how to create a new DataFrame by grouping an existing DataFrame by a particular column, and then using agg to apply a set of functions to each group. You also learned how to rename the columns in the new DataFrame to make them more descriptive. Overall, this function demonstrates the power and flexibility of Pandas for data manipulation and analysis.

Leave a comment