Aggregating
Adding and Modifying Columns
Adding a new column to a dataframe using assign
dogs = pd.read_csv(Path('data') / 'dogs43.csv', index_col='breed')
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
Chain methods together instead of writing long, hard-to-read lines
- Need to wrap expression in parentheses to add newlines before every method call
(dogs
.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
.sort_values('cost_per_year')
.iloc[:5]
)
Assign with special column names (spaces, special characters)
dogs.assign(**{'cost per year 💵': dogs['lifetime_cost'] / dogs['longevity']})
df.copy()
copies a dataframe in place
df[] =
assigns column in place
df.assign()
assigns a column to a new dataframe
Avoid inplace=True
- plans to remove in future releases of pandas, not good practice
df[column].to_numpy()
returns the numpy array of a column
dogs.max(axis=1)
won’t work because you are trying to take the max of a mix of datatypes
Data Granularity and the groupby
method
Fine granularity: small details
Coarse: bigger picture
You should opt for finer granularity for more detail if you have the resources to do so
How to go from fine to coarse granularity: Aggregating
Aggregation
Aggregating is the act of combining many values into a single value
penguins.groupby('species')['body_mass_g'].mean()
“Split-apply-combine” Paradigm
(penguins
.assign(is_dream = penguins['island'] == 'Dream')
.groupby('species')
['is_dream']
.mean()
)
%%pt
penguins_small.groupby('species')
Allows us to visualize groupby_objects
Aggregation Methods
count()
sum()
mean()
max()
last()
first()
(penguins
.sort_values('body_mass_g')
.groupby('sex')
.last()
)
Generally, you should select column(s) directly after groupby
Beyond default aggregation methods
(penguins
.groupby('species')
['body_mass_g']
.aggregate(['count', 'mean'])
)
(penguins
.groupby('species')
.aggregate({'bill_length_mm': 'max', 'island': 'unique'})
)
def iqr(s):
return np.percntile(s, 75) - np.percentile(s, 25)
(penguins
.groupby('species')
['body_mass_g']
.agg(iqr) # agg is short for .aggregate
)