Intermediate Python¶

Texas A&M University High Performance Research Computing

Texas A&M University Engineering Studio for Advanced Instruction and Learning

Texas A&M Engineering Experiment Station Educating Generations

"DataFrame Operations"¶

Learn how to use Use Python Pandas library for data manipulation

Pandas Module¶

Reminder: You should import a module once, usually at the beginning of the notebook.

Execute this cell to bring in pandas and numpy functions.

In [1]:
#don't forget to
import pandas
import numpy

DataFrame Methods¶

DataFrames have many methods for interacting with data.

Reminder: some methods, known as return methods, return a new DataFrame object. They can be chained together.

new_dataframe = dataframe.function().function()...

Reminder: some methods, known as in-place methods, don't return anything. They modify the DataFrame and the same object gets the updates. They cannot be chained together.

dataframe.function()

Relabel a row or column¶

Use the df.rename() return method to replace old labels with new labels. You provide a dictionary {} of old and new labels in pairs.

new_df = df.rename({old:new, ...} )
new_df = df.rename({old:new, ...}, axis=1)

To specify whether the provided list is a list of row or column labels, use the keyword argument axis. The axis value can be

  • 0 meaning 'rows'
  • 1 meaning 'columns'

It is 0 by default.

Example 1¶

Using the rename() method twice in a chain.

.rename({"row1":"ROW1"})
.rename({"column1":"COLUMN1"}, axis=1)

Execute the cell to see what happens.

In [2]:
# a quick 3x4 array
data_2d = numpy.arange(12).reshape(3, 4)

df_from_array = pandas.DataFrame(data=data_2d,
                                 index=["row1", "row2", "row3"],
                                 columns=["column1", "column2", "column3", "column4"]
                                 )

# relabeling twice
new_df = df_from_array.rename({"row1":"ROW1"}).rename({"column1":"COLUMN1"}, axis=1)

# display both
display(df_from_array)
print()
display(new_df)
column1 column2 column3 column4
row1 0 1 2 3
row2 4 5 6 7
row3 8 9 10 11

COLUMN1 column2 column3 column4
ROW1 0 1 2 3
row2 4 5 6 7
row3 8 9 10 11

Exercise 1¶

Relabel a different row and column of df_from_array. (Leave the original unchanged).

In [3]:
#your code here

Dropping Entries¶

We can drop either rows or columns in a DataFrame with the drop( ) return method. The argument is one or more more labels; for example:

  • a single label
  • list of labels
new_df = df.drop( [label or index, ...] )

To specify whether the provided list is a slice of rows or columns, use the keyword argument axis. The axis value can be

  • 0 meaning 'rows'
  • 1 meaning 'columns'

It is 0 by default.

new_df = df.drop( [label or index, ...], axis=0 )

Example 2¶

Execute the cells to see what happens.

In [4]:
# a quick 3x4 array
data_2d = numpy.arange(12).reshape(3, 4)

df_from_array = pandas.DataFrame(data=data_2d,
                                 index=["row1", "row2", "row3"],
                                 columns=["column1", "column2", "column3", "column4"]
                                 )

Drop a row

In [5]:
new_df = df_from_array.drop('row1', axis=0)
display(new_df)
column1 column2 column3 column4
row2 4 5 6 7
row3 8 9 10 11

Drop two columns

In [6]:
new_df = df_from_array.drop(['column1','column2'], axis=1)
display(new_df)
column3 column4
row1 2 3
row2 6 7
row3 10 11

Exercise 2¶

Create a dataframe (or borrow from above).

Select the last row by index (-1). Retrieve that row's label.

  • Hint: a row is a Series with a property Series.name that is equal to its row label.

Drop that row.

Display the result.

Can you it all on one line?

In [7]:
#your code here

Click here to see solution

display(df_from_array.drop(df_from_array.iloc[-1].name))

Sorting¶

The DataFrame return methods sort_index() and sort_values() return a new dataframe that is in a different order.

Sorting Labels¶

sort_index() can sort the index; it can also sort the columns.

  • (an integer index is always already sorted by definition).
new_df = df.sort_index()

To specify whether the sort the custom index or the columns, use the keyword argument axis. The axis value can be

  • 0 meaning 'custom index'
  • 1 meaning 'columns' It is 0 by default.
new_df = df.sort_index(axis=1)

To specify whether the sort the custom index or the columns forwards or backwards, use the keyword argument ascending. The ascending value can be

  • True meaning sort ascending
  • False meaning sort descending
new_df = df.sort_index(ascending=False)

Example 3¶

Read and execute the cells to see what happens.

Hint: The columns will be in a different order.

In [8]:
temp_dictionary = {
    "Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
    "Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
    "Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
    "Magnitude": [4.5, 3.8, 5.2, 4.0, 6.1],
    "Depth (km)": [10.0, 12.5, 8.0, 15.0, 20.0],
    "Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(
    temp_dictionary,
    index = ['r1', 'r2', 'r3', 'r4', 'r5']
)
display(df_from_lists)
Date Latitude Longitude Magnitude Depth (km) Region
r1 2023-05-01 34.05 -118.25 4.5 10.0 City1
r2 2023-05-02 37.77 -122.42 3.8 12.5 City2
r3 2023-05-03 40.71 -74.01 5.2 8.0 City3
r4 2023-05-04 34.05 -118.25 4.0 15.0 City4
r5 2023-05-05 35.68 139.69 6.1 20.0 City5
In [9]:
display(df_from_lists.sort_index(axis=1) )
Date Depth (km) Latitude Longitude Magnitude Region
r1 2023-05-01 10.0 34.05 -118.25 4.5 City1
r2 2023-05-02 12.5 37.77 -122.42 3.8 City2
r3 2023-05-03 8.0 40.71 -74.01 5.2 City3
r4 2023-05-04 15.0 34.05 -118.25 4.0 City4
r5 2023-05-05 20.0 35.68 139.69 6.1 City5

Exercise 3¶

Sort the DataFrame df_from_lists from Example 3 by its row labels in descending order.

Display the result.

In [10]:
#your code here

Click here to see solution

display(df_from_lists.sort_index(ascending=False))

Sorting Values¶

sort_values() can sort the rows or the columns based on what is in them.

new_df = df.sort_values()

Specify the row or column that is to be used for the sorting using the keyword argument by. Provide a row or column label.

new_df = df.sort_values( by=label or index)

To specify whether the provided label is a row or column, use the keyword argument axis. The axis value can be

  • 0 meaning sort the 'rows' by a provided 'column'
  • 1 meaning sort the 'columns' by a provided 'row'

It is 0 by default.

new_df = df.sort_values(axis=1)

To specify whether the sort the values forwards or backwards, use the keyword argument ascending. The ascending value can be

  • True meaning sort ascending
  • False meaning sort descending
new_df = df.sort_index(ascending=False)

Sorting with Sub-sorting¶

If you wish to sort with multiple rows or columns as the tie-breaker of the previous sort (i.e. sub-sort) then you can instead provide a list of row or column labels to be sorted, in order.

new_df = df.sort_values( by=[label or index, ... ])

If you are sorting multiple rows or columns and you want some of the sorts to be ascending but not others, you can provide a boolean list.

new_df = df.sort_index(ascending=[False,...] )

Example 4¶

Read and execute the cells to see what happens.

Hint: The rows will be in a different order.

In [11]:
temp_dictionary = {
    "Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
    "Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
    "Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
    "Magnitude": [4.5, 3.8, 5.2, 4.0, 6.1],
    "Depth (km)": [10.0, 12.5, 8.0, 15.0, 20.0],
    "Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(
    temp_dictionary,
    index = ['r1', 'r2', 'r3', 'r4', 'r5']
)
display(df_from_lists)
Date Latitude Longitude Magnitude Depth (km) Region
r1 2023-05-01 34.05 -118.25 4.5 10.0 City1
r2 2023-05-02 37.77 -122.42 3.8 12.5 City2
r3 2023-05-03 40.71 -74.01 5.2 8.0 City3
r4 2023-05-04 34.05 -118.25 4.0 15.0 City4
r5 2023-05-05 35.68 139.69 6.1 20.0 City5
In [12]:
display(df_from_lists.sort_values(by='Magnitude'))
Date Latitude Longitude Magnitude Depth (km) Region
r2 2023-05-02 37.77 -122.42 3.8 12.5 City2
r4 2023-05-04 34.05 -118.25 4.0 15.0 City4
r1 2023-05-01 34.05 -118.25 4.5 10.0 City1
r3 2023-05-03 40.71 -74.01 5.2 8.0 City3
r5 2023-05-05 35.68 139.69 6.1 20.0 City5

Exercise 4¶

Sort the DataFrame df_from_lists from Example 4 with two sort criteria.

  • Column 'Magnitude' as the primary sort (descending)
  • Column 'Depth (km)' as the tie-breaker (ascending).

Display the result.

In [13]:
#your code here

Click here to see solution

display(df_from_lists.sort_values(by=['Magnitude','Depth (km)'], ascending=[False, True]))

Reducing Methods¶

A reducing method takes an array of values and returns a single number. Examples:

Arithmetic:

  • sum()
  • prod()

Statistics:

  • min()
  • max()
  • mean()
  • median()
  • var()
  • std()

Logic:

  • any()
  • all()

Reduce One Column¶

Series objects provide these reducing methods. They just return a single value. They usually don't have an argument. If you only want to operate on a single column, this is the way to go.

value = df[column].function()

Reduce All Columns¶

DataFrame objects also provide these reducing methods. They operate on each row or each column. The return is a Series where the labels are column labels.

Specify whether to compute row-by-row or column-by-column using the axis agument.

Example:

  • df.function(axis=0) row-by-row
  • df.function(axis=1) column-by-column (default)

Tip: if some columns contain non-numeric values, use numeric_only=True to silence the warning.

Example 5¶

Using reducing function mean().

Execute the cells to see what happens.

In [14]:
temp_dictionary = {
    'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
    'year' : [2012, 2013, 2014, 2014, 2015],
    'pop' : [5.0, 5.1, 5.2, 4.0, 4.1],
    'unempl' : [5.8, 5.8, 5.8, 4.9, 4.9]
}
df_from_lists = pandas.DataFrame(
    temp_dictionary,
    columns=['year', 'state', 'pop', 'unempl'],
    index = ['r1', 'r2', 'r3', 'r4', 'r5']
)
In [15]:
print(df_from_lists['pop'].mean() )
4.68
In [16]:
print(df_from_lists.mean(numeric_only=True) )
year      2013.60
pop          4.68
unempl       5.44
dtype: float64

Exercise 5¶

Create a DataFrame with two columns of random numeric values.

  • (50 or more values for good results)

Compute the mean and standard deviation of both columns.

  • df.mean()
  • df.std().

Filter the DataFrame by whether the second column is greater than the first.

Compute the mean and standard deviation of both columns again (after filtering).

In [17]:
#your code here

Click here to see solution

random_array = numpy.random.rand(50,2)
random_df = pandas.DataFrame(random_array)
print("     overall")
print()
print("mean")
print(random_df.mean() )
print()
print("std")
print(random_df.std() )
print()
print("     column1 > column0")
print()
mask=random_df[1]>random_df[0]
print("mean")
print(random_df[mask].mean())
print()
print("std")
print(random_df[mask].std())

Overall:

  • The mean is about 0.5 for both columns.
  • The std is about 0.3 for both columns.

After filtering:

  • The mean of the first column decreases to about 0.33
  • The mean of the second column increases to about 0.66
  • The std decreases to about 0.24 for both columns.

Data Grouping¶

Suppose a DataFrame contains a column whose entries are categories.

Category other data
row1 type A 1.000
row2 type B 2.000
row3 type A 3.000
row4 type B 4.000

We may wish to perform operations separately on each category.

  1. Split by category
  2. Operate separately
  3. Recombine

Grouped DataFrames¶

The DataFrame method groupby() can help with this procedure.

  • The argument is a column label (or list thereof)
  • The return is a DataFrameGroupBy object
grouped_df = df.groupby(column)

DataFrameGroupBy object is like a container for multiple sub-DataFrames.

group label sub-DataFrame
group 1 DataFrame 1
group 2 DataFrame 2
  • The group labels are the unique values found in the groupby argument column.
  • The sub-DataFrames are filtered by the corresponding value.

DataFrameGroupBy object stores this information in a property .groups, which is a dictionary.

  • grouped_df.groups

You can retrieve a sub-DataFrame using the get_group() method.

sub_df = grouped_df.get_group( group_label )

You can iterate through all the sub-DataFrames as though the DataFrameGroupBy object was a dictionary.

for group_label, sub_df in grouped_df:

Example 6¶

Using groupby with the example DataFrame.

Execute the cells to see what happens.

In [18]:
temp_dictionary = {
    'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
    'year' : [2012, 2013, 2014, 2014, 2015],
    'pop' : [5.0, 5.1, 5.2, 4.0, 4.1],
    'unempl' : [5.8, 5.8, 5.8, 4.9, 4.9]
}
df_from_lists = pandas.DataFrame(
    temp_dictionary,
    columns=['year', 'state', 'pop', 'unempl'],
    index = ['r1', 'r2', 'r3', 'r4', 'r5']
)
display(df_from_lists)
year state pop unempl
r1 2012 VA 5.0 5.8
r2 2013 VA 5.1 5.8
r3 2014 VA 5.2 5.8
r4 2014 MD 4.0 4.9
r5 2015 MD 4.1 4.9
In [19]:
grouped_df = df_from_lists.groupby("state")
print(grouped_df.groups)
{'MD': ['r4', 'r5'], 'VA': ['r1', 'r2', 'r3']}
In [20]:
sub_df = grouped_df.get_group('MD')
display(sub_df)
year state pop unempl
r4 2014 MD 4.0 4.9
r5 2015 MD 4.1 4.9
In [21]:
for group_label, sub_df in grouped_df:
  print(group_label)
  display(sub_df)
  print()
MD
year state pop unempl
r4 2014 MD 4.0 4.9
r5 2015 MD 4.1 4.9
VA
year state pop unempl
r1 2012 VA 5.0 5.8
r2 2013 VA 5.1 5.8
r3 2014 VA 5.2 5.8

Exercise 6¶

Use the DataFrame df_from_lists from Example 6,

  • Group the DataFrame by 'year'. Iterate through the groups
  • display each sub-DataFrame.
  • Using each group label, retrieve (and print) the rows for that group from the grouped_df.groups dictionary.
In [22]:
#your code here

Click here to see solution

grouped_df = df_from_lists.groupby("year")
for group_label, sub_df in grouped_df:
  print(group_label)
  display(sub_df)
  print()
  print(grouped_df.groups[group_label])
  print()

Grouped DataFrame Operations¶

DataFrameGroupBy has the same columns as the dataframe, and they can be selected using slice notation. The result is also a DataFrameGroupBy object.

  • grouped_df[ column ]

DataFrameGroupBy has many of the same reducing methods as a DataFrame object. However, they return DataFrames instead of Series.

reduced_df = grouped_df.reducing_method()

The index of the reduced DataFrame is the former group labels.

Aggregate Method¶

DataFrameGroupBy has a method named agg() (short for aggregate) that helps with doing multiple reducing operations to multiple columns.

reduced_df = grouped_df.agg( )

The basic argument of agg() is a list of functions or strings. The strings must match the name of a reducing method.

reduced_df = grouped_df.agg( [...] )

Extend the argument of agg() to a dictionary if you want to specify different functions for each column (by label).

reduced_df = grouped_df.agg( {
  column label : [...],
  column label : [...],
  ...
} )

The agg() method will create new columns as necessary to hold all the reductions.

Example 7¶

Computing the mean of each column, grouped by state.

  • Using grouped_df's reducing method mean()
  • Using grouped_df.agg()

Execute the cells to see what happens.

In [23]:
temp_dictionary = {
    'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
    'year' : [2012, 2013, 2014, 2014, 2015],
    'pop' : [5.0, 5.1, 5.2, 4.0, 4.1],
    'unempl' : [5.8, 5.8, 5.8, 4.9, 4.9]
}
df_from_lists = pandas.DataFrame(
    temp_dictionary,
    columns=['year', 'state', 'pop', 'unempl'],
    index = ['r1', 'r2', 'r3', 'r4', 'r5']
)
grouped_df = df_from_lists.groupby("state")
In [24]:
reduced_df = grouped_df.mean()
display(reduced_df)
year pop unempl
state
MD 2014.5 4.05 4.9
VA 2013.0 5.10 5.8
In [25]:
reduced_df = grouped_df.agg('mean')
display(reduced_df)
year pop unempl
state
MD 2014.5 4.05 4.9
VA 2013.0 5.10 5.8

Exercise 7¶

Starting with the provided national_economics DataFrame, perform the following tasks.

  • Set the index to be the 'Country' column.

  • Create a new column named 'GDP Per Capita' which is calculate to be the ratio of 'GDP' and 'Population'.

    • (Don't forget to handle the units.)
  • Group the dataframe by 'Continent'.

  • Using the agg() function, perform the following reductions:

    • sum of 'GDP (Billions)'
    • sum of 'Population (Millions)'
    • mean, min, and max of 'GDP Per Capita'
In [26]:
national_economics = pandas.DataFrame({
    'Country': ['United States','China', 'Japan', 'Germany', 'India', 'United Kingdom', 'France', 'Italy', 'Brazil', 'Canada'],
    'Continent': ['America', 'Asia', 'Asia', 'Europe', 'Asia','Europe', 'Europe','Europe', 'America', 'America'],
    'GDP (Billions)': [18624.5, 11218.3, 4936.2, 3477.8, 2259.6, 2647.9, 2465.5, 1858.9, 1795.9, 1529.8],
    'Population (Millions)': [332.9, 1444.2, 126.1, 83.9, 1393.4, 68.2, 65.4, 60.4, 214.0, 38.1],
})

Click here to see solution

# inserting .set_index() here as a method chain
national_economics = pandas.DataFrame({
    'Country': ['United States','China', 'Japan', 'Germany', 'India', 'United Kingdom', 'France', 'Italy', 'Brazil', 'Canada'],
    'Continent': ['America', 'Asia', 'Asia', 'Europe', 'Asia','Europe', 'Europe','Europe', 'America', 'America'],
    'GDP (Billions)': [18624.5, 11218.3, 4936.2, 3477.8, 2259.6, 2647.9, 2465.5, 1858.9, 1795.9, 1529.8],
    'Population (Millions)': [332.9, 1444.2, 126.1, 83.9, 1393.4, 68.2, 65.4, 60.4, 214.0, 38.1],
}).set_index('Country')

# converting from Billions and Millions by multiplication
national_economics['GDP Per Capita'] = (national_economics['GDP (Billions)'] * 1e9) / (national_economics['Population (Millions)'] * 1e6)

# inserting groupby and agg here as a method chain
display(
    national_economics.groupby('Continent').agg({
      'GDP (Billions)':'sum',
      'Population (Millions)':'sum',
      'GDP Per Capita':['mean','min','max']
    })
)