Intermediate Python¶

Texas A&M University High Performance Research Computing

Texas A&M University Engineering Studio for Advanced Instruction and Learning

Texas A&M Engineering Experiment Station Educating Generations

"DataFrame Building"¶

Learn how to use Use Python Pandas library for data manipulation

Pandas Module¶

Reminder: You should import a module once, usually at the beginning of the notebook.

Execute this cell to bring in pandas and numpy functions.

In [1]:
import pandas
import numpy

Pandas DataFrame Class¶

Pandas provides the DataFrame data structure class. Note the capital letters.

In [2]:
print(pandas.DataFrame)
<class 'pandas.core.frame.DataFrame'>

DataFrames are like Spreadsheets¶

The DataFrame is the 2D extension of the Series. Both rows and columns have labels.

column1 column2
row1 data data
row2 data data

Building DataFrames from Data¶

We can build DataFrames using the constructor function DataFrame which as its first argument accepts an existing data structure.

new_dataframe = pandas.DataFrame( data_structure )

Example 0¶

Just checking that it works.

In [3]:
empty_dataframe = pandas.DataFrame()
print(type(empty_dataframe))
<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

DataFrame by Columns¶

We can assemble a DataFrame column-by-column by putting one-dimensional structures (such as lists) into a dictionary.

The keys of the dictionary become the column labels.

temporary_dict = {
  column1 : [...],
  column2 : [...],
  ...
}

Example 1¶

Three lists (the same length) become the columns of a DataFrame.

In [4]:
temporary_dict = {
    "Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
    "Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
    "Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
    "Magnitude": [4.1, 3.8, 5.2, 4.0, 6.1],
    "Depth (km)": [2.0, 12.5, 8.0, 15.0, 20.0],
    "Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(temporary_dict)
print(df_from_lists)
         Date  Latitude  Longitude  Magnitude  Depth (km) Region
0  2023-05-01     34.05    -118.25        4.1         2.0  City1
1  2023-05-02     37.77    -122.42        3.8        12.5  City2
2  2023-05-03     40.71     -74.01        5.2         8.0  City3
3  2023-05-04     34.05    -118.25        4.0        15.0  City4
4  2023-05-05     35.68     139.69        6.1        20.0  City5

Exercise 1¶

Create two pandas Series with the same custom index.

pandas.Series([...], index=[...])

Put the two Series into a temporary dictionary with keys that are column labels.

Create a DataFrame from the temporary dictionary.

Print the DataFrame to verify that the rows have same labels as the original Series did.

In [5]:
#your code here

Click here to see solution

s1=pandas.Series([1,2,3],index=['a','b','c'])
s2=pandas.Series([4,5,6],index=['a','b','c'])
temporary_dict={"column1":s1,"column2":s2}
df_from_series=pandas.DataFrame(temporary_dict)
print(df_from_series)

DataFrame by Rows¶

We can build a DataFrame row-by-row by putting dictionaries into a one-dimensional data structure (such as a list).

The keys of the dictionaries become the labels of the columns.

temporary_list = [
  { column1: ... , ... },
  { column1: ... , ... },
  ...
]

Example 2¶

Two dictionaries (with the same keys) become the rows of a DataFrame.

In [6]:
dict_1 = {
  "Date": "2023-05-01",
  "Latitude": 34.05,
  "Longitude": -118.25,
  "Magnitude": 4.5,
  "Depth (km)": 10.0,
  "Region": "City1"
}

dict_2 = {
  "Date": "2023-05-02",
  "Latitude": 37.77,
  "Longitude": -122.42,
  "Magnitude": 3.8,
  "Depth (km)": 12.5,
  "Region": "City2"
}

temporary_list = [dict_1, dict_2]

df_from_dicts = pandas.DataFrame(temporary_list)

print(df_from_dicts)
         Date  Latitude  Longitude  Magnitude  Depth (km) Region
0  2023-05-01     34.05    -118.25        4.5        10.0  City1
1  2023-05-02     37.77    -122.42        3.8        12.5  City2

Exercise 2¶

Create three dictionaries with some keys in common and some keys not in common.

Put the dictionaries into a temporary list.

Create a DataFrame from the temporary list.

Print the DataFrame and observe:

What happened to the keys that were in common?

What happened to the keys that were not in common?

In [7]:
#your code here

Click here to see solution

d1={'name':'ball' ,'count':3,'color':'red'}
d2={'name':'block','count':2,'weight':1.0}
d3={'name':'plush','count':1,'fabric':'soft'}

dataframe_from_dicts=pandas.DataFrame([d1, d2, d3])

print(dataframe_from_dicts)

The keys that are not in common become columns with NaN in the positions where the dictionaries did not have that key.

DataFrame from a 2D Array¶

Since 2D NumPy arrays have no labels, we use keyword arguments to set the labels in the DataFrame.

  • keyword data sets the values of the rows and columns
  • keyword index sets the labels of the rows from a list
  • keyword columns sets the labels of the columns from a list
df_from_array = pandas.DataFrame(data=array, index=[...], columns=[...])

Example 3¶

A NumPy 2D array becomes a DataFrame. The rows and columns remain rows and columns.

In [8]:
# a quick 3x4 array
data_2d = numpy.arange(12).reshape(3, 4)

df_from_array = pandas.DataFrame(data=data_2d,
                                 index=["row1", "row2", "row3"],
                                 columns=["column1", "column2", "column3", "column4"]
                                 )
print(df_from_array)
      column1  column2  column3  column4
row1        0        1        2        3
row2        4        5        6        7
row3        8        9       10       11

Exercise 3¶

Create a 2D Array of random numbers using numpy.random.rand()

Create a DataFrame with row and column labels from arithmetic sequences.

Print.

In [9]:
#your code here

Click here to see solution

data_2d = numpy.random.rand(5, 3)

df_from_array = pandas.DataFrame(data=data_2d,
                                 index=range(5,10),
                                 columns=range(96,99)
                                 )
print(df_from_array)

DataFrame from a File¶

Pandas has different reading tools.

For example these tools read spreadsheet files and html:

pandas.read_csv()
pandas.read_excel()
pandas..read_html()

Pandas can read column labels out of spreadsheet files and use those to label a DataFrame.

Example 4¶

Upload the example spreadsheet to your current session. Read it in using

pandas.read_csv(path_to_file)

Save the result in a variable and print the variable.

In [10]:
df_from_file = pandas.read_csv('example_spreadsheet.csv')
print(df_from_file)
     Date  Latitude  Longitude  Magnitude  Depth (km) Region
0  5/1/23     34.05    -118.25        4.5        10.0  City1
1  5/2/23     37.77    -122.42        3.8        12.5  City2
2  5/3/23     40.71     -74.01        5.2         8.0  City3
3  5/4/23     34.05    -118.25        4.0        15.0  City4
4  5/5/23     35.68     139.69        6.1        20.0  City5

Exercise 4¶

Create a spreadsheet in either csv or excel file format.

Read it in using the appropriate file reading tool. Print.

In [11]:
#your code here

DataFrame to a File¶

Likewise, pandas can write dataframes into spreadsheet files using DataFrame methods

df.to_csv(path_to_file)
df.to_excel(path_to_file)

You must name the file appropriately (excel files end in .xlsx) to pick a valid file format.

Example 5¶

DataFrame to excel file.

In [12]:
temporary_dict = {
    "Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
    "Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
    "Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
    "Magnitude": [4.5, 3.8, 5.2, 4.0, 6.1],
    "Depth (km)": [10.0, 12.5, 8.0, 15.0, 20.0],
    "Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(temporary_dict)
df_from_lists.to_csv("example_spreadsheet_2.csv")

Exercise 5¶

Create a DataFrame.

Write to csv file.

Read it back in (a new variable) and print to check whether it is the same.

In [13]:
#your code here

Click here to see solution

temporary_dict = {
    "Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
    "Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
    "Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
    "Magnitude": [4.5, 3.8, 5.2, 4.0, 6.1],
    "Depth (km)": [10.0, 12.5, 8.0, 15.0, 20.0],
    "Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(temporary_dict)
df_from_lists.to_csv("exercise_spreadsheet.csv")

df_from_csv = pandas.read_csv("exercise_spreadsheet.csv")
print(df_from_csv)

It is not the same because the old index got read in as an unnamed column, and also the new dataframe got a new index.

Fixing DataFrame Columns¶

We have already seen that creating DataFrame from a dictionary will result in columns that have the same labels as the keys of the dictionary.

df = pandas.DataFrame(dictionary)
  • However, in that case, we didn't get to pick the order of the columns.

We can specify an order or subset of columns by listing column labels in the columns keyword argument.

df = pandas.DataFrame(dictionary, columns=[...] )

The usecols argument can be used to select columns when reading from files.

df = pandas.read_<filetype>(path_to_file, usecols=[...] )

Example 6¶

Creating a DataFrame as in Example 1 but with different columns.

In [14]:
temporary_dict = {
    "Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
    "Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
    "Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
    "Magnitude": [4.5, 3.8, 5.2, 4.0, 6.1],
    "Depth (km)": [10.0, 12.5, 8.0, 15.0, 20.0],
    "Region": ["City1", "City2", "City3", "City4", "City5"]
}

df_from_lists = pandas.DataFrame(temporary_dict, columns=["Date","Magnitude","Region"])

print(df_from_lists)
         Date  Magnitude Region
0  2023-05-01        4.5  City1
1  2023-05-02        3.8  City2
2  2023-05-03        5.2  City3
3  2023-05-04        4.0  City4
4  2023-05-05        6.1  City5

Exercise 6¶

Read in the example spreadsheet as in Example 4 but specify columns as in Example 6. (Hint: usecols)

In [15]:
#your code here

Click here to see solution

df_from_file = pandas.read_csv("example_spreadsheet.csv", usecols=["Date","Magnitude","Region"])
print(df_from_file)

Not quite the same. usecols specifies a subset but not an order.

Fixing DataFrame Index¶

When we created a DataFrame from data, sometimes the intended index (row labels) gets assigned to one of the columns.

The set_index DataFrame method can modify a DataFrame to move one of the columns over to use it as the index. It returns the new DataFrame.

new_df = df.set_index(column_label)

When reading in from a file, we can prevent the unwanted outcome by specifying a column to use as the index:

df = pandas.read_<filetype>(path_to_file, index_col= <column_label>)

Example 7¶

Writing a file and reading it back in causes the index to be misassigned.

The column 'Unnamed: 0' (position 0 in the file) holds the row labels.

We can fix it using the set_index method.

Specifying the index as column 0 when reading in prevents that.

In [16]:
# a quick 3x4 array
data_2d = numpy.arange(12).reshape(3, 4)

df_from_array = pandas.DataFrame(data=data_2d,
                                 index=["row1", "row2", "row3"],
                                 columns=["column1", "column2", "column3", "column4"]
                                 )
# creating a file
df_from_array.to_csv("example_7.csv")

# reading back in
df_from_csv = pandas.read_csv("example_7.csv")
print(df_from_csv)

# set_index method
df_from_csv = df_from_csv.set_index("Unnamed: 0")
print(df_from_csv)

# keyword argument method
df_from_csv = pandas.read_csv("example_7.csv", index_col=0)
print(df_from_csv)
  Unnamed: 0  column1  column2  column3  column4
0       row1        0        1        2        3
1       row2        4        5        6        7
2       row3        8        9       10       11
            column1  column2  column3  column4
Unnamed: 0                                    
row1              0        1        2        3
row2              4        5        6        7
row3              8        9       10       11
      column1  column2  column3  column4
row1        0        1        2        3
row2        4        5        6        7
row3        8        9       10       11

The set_index method keeps the column name intact, so the index has a label just like a column does.

Learn more about that in a future lesson.

Exercise 7¶

Create a DataFrame with the following three arrays as columns (given in the cell below). Don't set an index.

  • Hint: put them in a dictionary.

Modify the DataFrame so that the dates column becomes the index. Use a "method chain".

  • Hint: set_index() is a return method, so it can be chained with other functions that return DataFrames.

Yes, I know you could have just set the index using the index keyword argument. The point of this exercise is to try something different.

In [17]:
#creating a date range index over 30 days
dates = pandas.date_range('2022-09-01', '2022-09-30')
#creating random prices
price1 = numpy.round(numpy.random.rand(30)*10 + 10, decimals=2)
price2 = numpy.round(numpy.random.rand(30)*10 + 10, decimals=2)
#your code here

Click here to see solution

#creating a date range index over 30 days
dates = pandas.date_range('2022-09-01', '2022-09-30')
#creating random prices
price1 = numpy.round(numpy.random.rand(30)*10 + 10, decimals=2)
price2 = numpy.round(numpy.random.rand(30)*10 + 10, decimals=2)

temporary_dict = {
    'dates' : dates,
    'price1' : price1,
    'price2' : price2
}

df_with_index = pandas.DataFrame(temporary_dict).set_index('dates')

print(df_with_index)

In [ ]: