Intermediate Python¶
Texas A&M University High Performance Research Computing
Texas A&M University Engineering Studio for Advanced Instruction and Learning
Texas A&M Engineering Experiment Station Educating Generations
"DataFrame Operations"¶
Learn how to use Use Python Pandas library for data manipulation
Pandas Module¶
Reminder: You should import a module once, usually at the beginning of the notebook.
Execute this cell to bring in pandas
and numpy
functions.
#don't forget to
import pandas
import numpy
DataFrame Methods¶
DataFrames have many methods for interacting with data.
Reminder: some methods, known as return methods, return a new DataFrame object. They can be chained together.
new_dataframe = dataframe.function().function()...
Reminder: some methods, known as in-place methods, don't return anything. They modify the DataFrame and the same object gets the updates. They cannot be chained together.
dataframe.function()
Relabel a row or column¶
Use the df.rename()
return method to replace old labels with new labels. You provide a dictionary {}
of old and new labels in pairs.
new_df = df.rename({old:new, ...} )
new_df = df.rename({old:new, ...}, axis=1)
To specify whether the provided list is a list of row or column labels, use the keyword argument axis
. The axis value can be
- 0 meaning 'rows'
- 1 meaning 'columns'
It is 0 by default.
Example 1¶
Using the rename()
method twice in a chain.
.rename({"row1":"ROW1"})
.rename({"column1":"COLUMN1"}, axis=1)
Execute the cell to see what happens.
# a quick 3x4 array
data_2d = numpy.arange(12).reshape(3, 4)
df_from_array = pandas.DataFrame(data=data_2d,
index=["row1", "row2", "row3"],
columns=["column1", "column2", "column3", "column4"]
)
# relabeling twice
new_df = df_from_array.rename({"row1":"ROW1"}).rename({"column1":"COLUMN1"}, axis=1)
# display both
display(df_from_array)
print()
display(new_df)
column1 | column2 | column3 | column4 | |
---|---|---|---|---|
row1 | 0 | 1 | 2 | 3 |
row2 | 4 | 5 | 6 | 7 |
row3 | 8 | 9 | 10 | 11 |
COLUMN1 | column2 | column3 | column4 | |
---|---|---|---|---|
ROW1 | 0 | 1 | 2 | 3 |
row2 | 4 | 5 | 6 | 7 |
row3 | 8 | 9 | 10 | 11 |
Exercise 1¶
Relabel a different row and column of df_from_array
. (Leave the original unchanged).
#your code here
Dropping Entries¶
We can drop either rows or columns in a DataFrame with the drop( )
return method. The argument is one or more more labels; for example:
- a single label
- list of labels
new_df = df.drop( [label or index, ...] )
To specify whether the provided list is a slice of rows or columns, use the keyword argument axis
. The axis value can be
- 0 meaning 'rows'
- 1 meaning 'columns'
It is 0 by default.
new_df = df.drop( [label or index, ...], axis=0 )
Example 2¶
Execute the cells to see what happens.
# a quick 3x4 array
data_2d = numpy.arange(12).reshape(3, 4)
df_from_array = pandas.DataFrame(data=data_2d,
index=["row1", "row2", "row3"],
columns=["column1", "column2", "column3", "column4"]
)
Drop a row
new_df = df_from_array.drop('row1', axis=0)
display(new_df)
column1 | column2 | column3 | column4 | |
---|---|---|---|---|
row2 | 4 | 5 | 6 | 7 |
row3 | 8 | 9 | 10 | 11 |
Drop two columns
new_df = df_from_array.drop(['column1','column2'], axis=1)
display(new_df)
column3 | column4 | |
---|---|---|
row1 | 2 | 3 |
row2 | 6 | 7 |
row3 | 10 | 11 |
Exercise 2¶
Create a dataframe (or borrow from above).
Select the last row by index (-1). Retrieve that row's label.
- Hint: a row is a
Series
with a propertySeries.name
that is equal to its row label.
Drop that row.
Display the result.
Can you it all on one line?
#your code here
Click here to see solution
display(df_from_array.drop(df_from_array.iloc[-1].name))
Sorting¶
The DataFrame return methods sort_index()
and sort_values()
return a new dataframe that is in a different order.
Sorting Labels¶
sort_index()
can sort the index; it can also sort the columns.
- (an integer index is always already sorted by definition).
new_df = df.sort_index()
To specify whether the sort the custom index or the columns, use the keyword argument axis
. The axis value can be
- 0 meaning 'custom index'
- 1 meaning 'columns' It is 0 by default.
new_df = df.sort_index(axis=1)
To specify whether the sort the custom index or the columns forwards or backwards, use the keyword argument ascending
. The ascending value can be
True
meaning sort ascendingFalse
meaning sort descending
new_df = df.sort_index(ascending=False)
Example 3¶
Read and execute the cells to see what happens.
Hint: The columns will be in a different order.
temp_dictionary = {
"Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
"Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
"Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
"Magnitude": [4.5, 3.8, 5.2, 4.0, 6.1],
"Depth (km)": [10.0, 12.5, 8.0, 15.0, 20.0],
"Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(
temp_dictionary,
index = ['r1', 'r2', 'r3', 'r4', 'r5']
)
display(df_from_lists)
Date | Latitude | Longitude | Magnitude | Depth (km) | Region | |
---|---|---|---|---|---|---|
r1 | 2023-05-01 | 34.05 | -118.25 | 4.5 | 10.0 | City1 |
r2 | 2023-05-02 | 37.77 | -122.42 | 3.8 | 12.5 | City2 |
r3 | 2023-05-03 | 40.71 | -74.01 | 5.2 | 8.0 | City3 |
r4 | 2023-05-04 | 34.05 | -118.25 | 4.0 | 15.0 | City4 |
r5 | 2023-05-05 | 35.68 | 139.69 | 6.1 | 20.0 | City5 |
display(df_from_lists.sort_index(axis=1) )
Date | Depth (km) | Latitude | Longitude | Magnitude | Region | |
---|---|---|---|---|---|---|
r1 | 2023-05-01 | 10.0 | 34.05 | -118.25 | 4.5 | City1 |
r2 | 2023-05-02 | 12.5 | 37.77 | -122.42 | 3.8 | City2 |
r3 | 2023-05-03 | 8.0 | 40.71 | -74.01 | 5.2 | City3 |
r4 | 2023-05-04 | 15.0 | 34.05 | -118.25 | 4.0 | City4 |
r5 | 2023-05-05 | 20.0 | 35.68 | 139.69 | 6.1 | City5 |
Exercise 3¶
Sort the DataFrame df_from_lists
from Example 3 by its row labels in descending order.
Display the result.
#your code here
Click here to see solution
display(df_from_lists.sort_index(ascending=False))
Sorting Values¶
sort_values()
can sort the rows or the columns based on what is in them.
new_df = df.sort_values()
Specify the row or column that is to be used for the sorting using the keyword argument by
. Provide a row or column label.
new_df = df.sort_values( by=label or index)
To specify whether the provided label is a row or column, use the keyword argument axis
. The axis value can be
- 0 meaning sort the 'rows' by a provided 'column'
- 1 meaning sort the 'columns' by a provided 'row'
It is 0 by default.
new_df = df.sort_values(axis=1)
To specify whether the sort the values forwards or backwards, use the keyword argument ascending
. The ascending value can be
True
meaning sort ascendingFalse
meaning sort descending
new_df = df.sort_index(ascending=False)
Sorting with Sub-sorting¶
If you wish to sort with multiple rows or columns as the tie-breaker of the previous sort (i.e. sub-sort) then you can instead provide a list of row or column labels to be sorted, in order.
new_df = df.sort_values( by=[label or index, ... ])
If you are sorting multiple rows or columns and you want some of the sorts to be ascending but not others, you can provide a boolean list.
new_df = df.sort_index(ascending=[False,...] )
Example 4¶
Read and execute the cells to see what happens.
Hint: The rows will be in a different order.
temp_dictionary = {
"Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
"Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
"Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
"Magnitude": [4.5, 3.8, 5.2, 4.0, 6.1],
"Depth (km)": [10.0, 12.5, 8.0, 15.0, 20.0],
"Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(
temp_dictionary,
index = ['r1', 'r2', 'r3', 'r4', 'r5']
)
display(df_from_lists)
Date | Latitude | Longitude | Magnitude | Depth (km) | Region | |
---|---|---|---|---|---|---|
r1 | 2023-05-01 | 34.05 | -118.25 | 4.5 | 10.0 | City1 |
r2 | 2023-05-02 | 37.77 | -122.42 | 3.8 | 12.5 | City2 |
r3 | 2023-05-03 | 40.71 | -74.01 | 5.2 | 8.0 | City3 |
r4 | 2023-05-04 | 34.05 | -118.25 | 4.0 | 15.0 | City4 |
r5 | 2023-05-05 | 35.68 | 139.69 | 6.1 | 20.0 | City5 |
display(df_from_lists.sort_values(by='Magnitude'))
Date | Latitude | Longitude | Magnitude | Depth (km) | Region | |
---|---|---|---|---|---|---|
r2 | 2023-05-02 | 37.77 | -122.42 | 3.8 | 12.5 | City2 |
r4 | 2023-05-04 | 34.05 | -118.25 | 4.0 | 15.0 | City4 |
r1 | 2023-05-01 | 34.05 | -118.25 | 4.5 | 10.0 | City1 |
r3 | 2023-05-03 | 40.71 | -74.01 | 5.2 | 8.0 | City3 |
r5 | 2023-05-05 | 35.68 | 139.69 | 6.1 | 20.0 | City5 |
Exercise 4¶
Sort the DataFrame df_from_lists
from Example 4 with two sort criteria.
- Column
'Magnitude'
as the primary sort (descending) - Column
'Depth (km)'
as the tie-breaker (ascending).
Display the result.
#your code here
Click here to see solution
display(df_from_lists.sort_values(by=['Magnitude','Depth (km)'], ascending=[False, True]))
Reducing Methods¶
A reducing method takes an array of values and returns a single number. Examples:
Arithmetic:
sum()
prod()
Statistics:
min()
max()
mean()
median()
var()
std()
Logic:
any()
all()
Reduce One Column¶
Series
objects provide these reducing methods. They just return a single value. They usually don't have an argument. If you only want to operate on a single column, this is the way to go.
value = df[column].function()
Reduce All Columns¶
DataFrame
objects also provide these reducing methods. They operate on each row or each column. The return is a Series
where the labels are column labels.
Specify whether to compute row-by-row or column-by-column using the axis
agument.
Example:
df.function(axis=0)
row-by-rowdf.function(axis=1)
column-by-column (default)
Tip: if some columns contain non-numeric values, use numeric_only=True
to silence the warning.
temp_dictionary = {
'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
'year' : [2012, 2013, 2014, 2014, 2015],
'pop' : [5.0, 5.1, 5.2, 4.0, 4.1],
'unempl' : [5.8, 5.8, 5.8, 4.9, 4.9]
}
df_from_lists = pandas.DataFrame(
temp_dictionary,
columns=['year', 'state', 'pop', 'unempl'],
index = ['r1', 'r2', 'r3', 'r4', 'r5']
)
print(df_from_lists['pop'].mean() )
4.68
print(df_from_lists.mean(numeric_only=True) )
year 2013.60 pop 4.68 unempl 5.44 dtype: float64
Exercise 5¶
Create a DataFrame with two columns of random numeric values.
- (50 or more values for good results)
Compute the mean and standard deviation of both columns.
df.mean()
df.std()
.
Filter the DataFrame by whether the second column is greater than the first.
Compute the mean and standard deviation of both columns again (after filtering).
#your code here
Click here to see solution
random_array = numpy.random.rand(50,2)
random_df = pandas.DataFrame(random_array)
print(" overall")
print()
print("mean")
print(random_df.mean() )
print()
print("std")
print(random_df.std() )
print()
print(" column1 > column0")
print()
mask=random_df[1]>random_df[0]
print("mean")
print(random_df[mask].mean())
print()
print("std")
print(random_df[mask].std())
Overall:
- The
mean
is about 0.5 for both columns. - The
std
is about 0.3 for both columns.
After filtering:
- The
mean
of the first column decreases to about 0.33 - The
mean
of the second column increases to about 0.66 - The
std
decreases to about 0.24 for both columns.
Data Grouping¶
Suppose a DataFrame contains a column whose entries are categories.
Category | other data | |
---|---|---|
row1 | type A | 1.000 |
row2 | type B | 2.000 |
row3 | type A | 3.000 |
row4 | type B | 4.000 |
We may wish to perform operations separately on each category.
- Split by category
- Operate separately
- Recombine
Grouped DataFrames¶
The DataFrame method groupby()
can help with this procedure.
- The argument is a column label (or list thereof)
- The return is a DataFrameGroupBy object
grouped_df = df.groupby(column)
DataFrameGroupBy
object is like a container for multiple sub-DataFrames.
group label | sub-DataFrame |
---|---|
group 1 | DataFrame 1 |
group 2 | DataFrame 2 |
- The group labels are the unique values found in the
groupby
argument column. - The sub-DataFrames are filtered by the corresponding value.
DataFrameGroupBy
object stores this information in a property .groups
, which is a dictionary.
grouped_df.groups
You can retrieve a sub-DataFrame using the get_group()
method.
sub_df = grouped_df.get_group( group_label )
You can iterate through all the sub-DataFrames as though the DataFrameGroupBy
object was a dictionary.
for group_label, sub_df in grouped_df:
temp_dictionary = {
'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
'year' : [2012, 2013, 2014, 2014, 2015],
'pop' : [5.0, 5.1, 5.2, 4.0, 4.1],
'unempl' : [5.8, 5.8, 5.8, 4.9, 4.9]
}
df_from_lists = pandas.DataFrame(
temp_dictionary,
columns=['year', 'state', 'pop', 'unempl'],
index = ['r1', 'r2', 'r3', 'r4', 'r5']
)
display(df_from_lists)
year | state | pop | unempl | |
---|---|---|---|---|
r1 | 2012 | VA | 5.0 | 5.8 |
r2 | 2013 | VA | 5.1 | 5.8 |
r3 | 2014 | VA | 5.2 | 5.8 |
r4 | 2014 | MD | 4.0 | 4.9 |
r5 | 2015 | MD | 4.1 | 4.9 |
grouped_df = df_from_lists.groupby("state")
print(grouped_df.groups)
{'MD': ['r4', 'r5'], 'VA': ['r1', 'r2', 'r3']}
sub_df = grouped_df.get_group('MD')
display(sub_df)
year | state | pop | unempl | |
---|---|---|---|---|
r4 | 2014 | MD | 4.0 | 4.9 |
r5 | 2015 | MD | 4.1 | 4.9 |
for group_label, sub_df in grouped_df:
print(group_label)
display(sub_df)
print()
MD
year | state | pop | unempl | |
---|---|---|---|---|
r4 | 2014 | MD | 4.0 | 4.9 |
r5 | 2015 | MD | 4.1 | 4.9 |
VA
year | state | pop | unempl | |
---|---|---|---|---|
r1 | 2012 | VA | 5.0 | 5.8 |
r2 | 2013 | VA | 5.1 | 5.8 |
r3 | 2014 | VA | 5.2 | 5.8 |
Exercise 6¶
Use the DataFrame df_from_lists
from Example 6,
- Group the DataFrame by
'year'
. Iterate through the groups display
each sub-DataFrame.- Using each group label, retrieve (and print) the rows for that group from the
grouped_df.groups
dictionary.
#your code here
Click here to see solution
grouped_df = df_from_lists.groupby("year")
for group_label, sub_df in grouped_df:
print(group_label)
display(sub_df)
print()
print(grouped_df.groups[group_label])
print()
Grouped DataFrame Operations¶
DataFrameGroupBy
has the same columns as the dataframe, and they can be selected using slice notation. The result is also a DataFrameGroupBy
object.
grouped_df[ column ]
DataFrameGroupBy
has many of the same reducing methods as a DataFrame
object. However, they return DataFrames
instead of Series
.
reduced_df = grouped_df.reducing_method()
The index
of the reduced DataFrame is the former group labels.
Aggregate Method¶
DataFrameGroupBy
has a method named agg()
(short for aggregate) that helps with doing multiple reducing operations to multiple columns.
reduced_df = grouped_df.agg( )
The basic argument of agg()
is a list of functions or strings. The strings must match the name of a reducing method.
reduced_df = grouped_df.agg( [...] )
Extend the argument of agg()
to a dictionary if you want to specify different functions for each column (by label).
reduced_df = grouped_df.agg( {
column label : [...],
column label : [...],
...
} )
The agg()
method will create new columns as necessary to hold all the reductions.
Example 7¶
Computing the mean
of each column, grouped by state.
- Using
grouped_df
's reducing methodmean()
- Using
grouped_df.agg()
Execute the cells to see what happens.
temp_dictionary = {
'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
'year' : [2012, 2013, 2014, 2014, 2015],
'pop' : [5.0, 5.1, 5.2, 4.0, 4.1],
'unempl' : [5.8, 5.8, 5.8, 4.9, 4.9]
}
df_from_lists = pandas.DataFrame(
temp_dictionary,
columns=['year', 'state', 'pop', 'unempl'],
index = ['r1', 'r2', 'r3', 'r4', 'r5']
)
grouped_df = df_from_lists.groupby("state")
reduced_df = grouped_df.mean()
display(reduced_df)
year | pop | unempl | |
---|---|---|---|
state | |||
MD | 2014.5 | 4.05 | 4.9 |
VA | 2013.0 | 5.10 | 5.8 |
reduced_df = grouped_df.agg('mean')
display(reduced_df)
year | pop | unempl | |
---|---|---|---|
state | |||
MD | 2014.5 | 4.05 | 4.9 |
VA | 2013.0 | 5.10 | 5.8 |
Exercise 7¶
Starting with the provided national_economics
DataFrame, perform the following tasks.
Set the
index
to be the'Country'
column.Create a new column named
'GDP Per Capita'
which is calculate to be the ratio of'GDP'
and'Population'
.- (Don't forget to handle the units.)
Group the dataframe by
'Continent'
.Using the
agg()
function, perform the following reductions:sum
of'GDP (Billions)'
sum
of'Population (Millions)'
mean
,min
, andmax
of'GDP Per Capita'
national_economics = pandas.DataFrame({
'Country': ['United States','China', 'Japan', 'Germany', 'India', 'United Kingdom', 'France', 'Italy', 'Brazil', 'Canada'],
'Continent': ['America', 'Asia', 'Asia', 'Europe', 'Asia','Europe', 'Europe','Europe', 'America', 'America'],
'GDP (Billions)': [18624.5, 11218.3, 4936.2, 3477.8, 2259.6, 2647.9, 2465.5, 1858.9, 1795.9, 1529.8],
'Population (Millions)': [332.9, 1444.2, 126.1, 83.9, 1393.4, 68.2, 65.4, 60.4, 214.0, 38.1],
})
Click here to see solution
# inserting .set_index() here as a method chain
national_economics = pandas.DataFrame({
'Country': ['United States','China', 'Japan', 'Germany', 'India', 'United Kingdom', 'France', 'Italy', 'Brazil', 'Canada'],
'Continent': ['America', 'Asia', 'Asia', 'Europe', 'Asia','Europe', 'Europe','Europe', 'America', 'America'],
'GDP (Billions)': [18624.5, 11218.3, 4936.2, 3477.8, 2259.6, 2647.9, 2465.5, 1858.9, 1795.9, 1529.8],
'Population (Millions)': [332.9, 1444.2, 126.1, 83.9, 1393.4, 68.2, 65.4, 60.4, 214.0, 38.1],
}).set_index('Country')
# converting from Billions and Millions by multiplication
national_economics['GDP Per Capita'] = (national_economics['GDP (Billions)'] * 1e9) / (national_economics['Population (Millions)'] * 1e6)
# inserting groupby and agg here as a method chain
display(
national_economics.groupby('Continent').agg({
'GDP (Billions)':'sum',
'Population (Millions)':'sum',
'GDP Per Capita':['mean','min','max']
})
)