Intermediate Python¶
Texas A&M University High Performance Research Computing
Texas A&M University Engineering Studio for Advanced Instruction and Learning
Texas A&M Engineering Experiment Station Educating Generations
"DataFrame Building"¶
Learn how to use Use Python Pandas library for data manipulation
Pandas Module¶
Reminder: You should import a module once, usually at the beginning of the notebook.
Execute this cell to bring in pandas
and numpy
functions.
import pandas
import numpy
Pandas DataFrame Class¶
Pandas provides the DataFrame
data structure class. Note the capital letters.
print(pandas.DataFrame)
<class 'pandas.core.frame.DataFrame'>
DataFrames are like Spreadsheets¶
The DataFrame
is the 2D extension of the Series
. Both rows and columns have labels.
column1 | column2 | |
---|---|---|
row1 | data |
data |
row2 | data |
data |
Building DataFrames from Data¶
We can build DataFrames using the constructor function DataFrame
which as its first argument accepts an existing data structure.
new_dataframe = pandas.DataFrame( data_structure )
Example 0¶
Just checking that it works.
empty_dataframe = pandas.DataFrame()
print(type(empty_dataframe))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
DataFrame by Columns¶
We can assemble a DataFrame column-by-column by putting one-dimensional structures (such as lists) into a dictionary.
The keys of the dictionary become the column labels.
temporary_dict = {
column1 : [...],
column2 : [...],
...
}
Example 1¶
Three lists (the same length) become the columns of a DataFrame.
temporary_dict = {
"Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
"Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
"Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
"Magnitude": [4.1, 3.8, 5.2, 4.0, 6.1],
"Depth (km)": [2.0, 12.5, 8.0, 15.0, 20.0],
"Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(temporary_dict)
print(df_from_lists)
Date Latitude Longitude Magnitude Depth (km) Region 0 2023-05-01 34.05 -118.25 4.1 2.0 City1 1 2023-05-02 37.77 -122.42 3.8 12.5 City2 2 2023-05-03 40.71 -74.01 5.2 8.0 City3 3 2023-05-04 34.05 -118.25 4.0 15.0 City4 4 2023-05-05 35.68 139.69 6.1 20.0 City5
Exercise 1¶
Create two pandas Series
with the same custom index.
pandas.Series([...], index=[...])
Put the two Series
into a temporary dictionary with keys that are column labels.
Create a DataFrame from the temporary dictionary.
Print the DataFrame to verify that the rows have same labels as the original Series
did.
#your code here
Click here to see solution
s1=pandas.Series([1,2,3],index=['a','b','c'])
s2=pandas.Series([4,5,6],index=['a','b','c'])
temporary_dict={"column1":s1,"column2":s2}
df_from_series=pandas.DataFrame(temporary_dict)
print(df_from_series)
DataFrame by Rows¶
We can build a DataFrame row-by-row by putting dictionaries into a one-dimensional data structure (such as a list).
The keys of the dictionaries become the labels of the columns.
temporary_list = [
{ column1: ... , ... },
{ column1: ... , ... },
...
]
Example 2¶
Two dictionaries (with the same keys) become the rows of a DataFrame.
dict_1 = {
"Date": "2023-05-01",
"Latitude": 34.05,
"Longitude": -118.25,
"Magnitude": 4.5,
"Depth (km)": 10.0,
"Region": "City1"
}
dict_2 = {
"Date": "2023-05-02",
"Latitude": 37.77,
"Longitude": -122.42,
"Magnitude": 3.8,
"Depth (km)": 12.5,
"Region": "City2"
}
temporary_list = [dict_1, dict_2]
df_from_dicts = pandas.DataFrame(temporary_list)
print(df_from_dicts)
Date Latitude Longitude Magnitude Depth (km) Region 0 2023-05-01 34.05 -118.25 4.5 10.0 City1 1 2023-05-02 37.77 -122.42 3.8 12.5 City2
Exercise 2¶
Create three dictionaries with some keys in common and some keys not in common.
Put the dictionaries into a temporary list.
Create a DataFrame from the temporary list.
Print the DataFrame and observe:
What happened to the keys that were in common?
What happened to the keys that were not in common?
#your code here
Click here to see solution
d1={'name':'ball' ,'count':3,'color':'red'}
d2={'name':'block','count':2,'weight':1.0}
d3={'name':'plush','count':1,'fabric':'soft'}
dataframe_from_dicts=pandas.DataFrame([d1, d2, d3])
print(dataframe_from_dicts)
The keys that are not in common become columns with NaN
in the positions where the dictionaries did not have that key.
DataFrame from a 2D Array¶
Since 2D NumPy arrays have no labels, we use keyword arguments to set the labels in the DataFrame.
- keyword
data
sets the values of the rows and columns - keyword
index
sets the labels of the rows from a list - keyword
columns
sets the labels of the columns from a list
df_from_array = pandas.DataFrame(data=array, index=[...], columns=[...])
Example 3¶
A NumPy 2D array becomes a DataFrame. The rows and columns remain rows and columns.
# a quick 3x4 array
data_2d = numpy.arange(12).reshape(3, 4)
df_from_array = pandas.DataFrame(data=data_2d,
index=["row1", "row2", "row3"],
columns=["column1", "column2", "column3", "column4"]
)
print(df_from_array)
column1 column2 column3 column4 row1 0 1 2 3 row2 4 5 6 7 row3 8 9 10 11
Exercise 3¶
Create a 2D Array of random numbers using numpy.random.rand()
Create a DataFrame with row and column labels from arithmetic sequences.
Print.
#your code here
Click here to see solution
data_2d = numpy.random.rand(5, 3)
df_from_array = pandas.DataFrame(data=data_2d,
index=range(5,10),
columns=range(96,99)
)
print(df_from_array)
DataFrame from a File¶
Pandas has different reading tools.
For example these tools read spreadsheet files and html:
pandas.read_csv()
pandas.read_excel()
pandas..read_html()
Pandas can read column labels out of spreadsheet files and use those to label a DataFrame.
Example 4¶
Upload the example spreadsheet to your current session. Read it in using
pandas.read_csv(path_to_file)
Save the result in a variable and print the variable.
df_from_file = pandas.read_csv('example_spreadsheet.csv')
print(df_from_file)
Date Latitude Longitude Magnitude Depth (km) Region 0 5/1/23 34.05 -118.25 4.5 10.0 City1 1 5/2/23 37.77 -122.42 3.8 12.5 City2 2 5/3/23 40.71 -74.01 5.2 8.0 City3 3 5/4/23 34.05 -118.25 4.0 15.0 City4 4 5/5/23 35.68 139.69 6.1 20.0 City5
Exercise 4¶
Create a spreadsheet in either csv or excel file format.
Read it in using the appropriate file reading tool. Print.
#your code here
DataFrame to a File¶
Likewise, pandas can write dataframes into spreadsheet files using DataFrame methods
df.to_csv(path_to_file)
df.to_excel(path_to_file)
You must name the file appropriately (excel files end in .xlsx
) to pick a valid file format.
Example 5¶
DataFrame to excel file.
temporary_dict = {
"Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
"Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
"Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
"Magnitude": [4.5, 3.8, 5.2, 4.0, 6.1],
"Depth (km)": [10.0, 12.5, 8.0, 15.0, 20.0],
"Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(temporary_dict)
df_from_lists.to_csv("example_spreadsheet_2.csv")
Exercise 5¶
Create a DataFrame.
Write to csv file.
Read it back in (a new variable) and print to check whether it is the same.
#your code here
Click here to see solution
temporary_dict = {
"Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
"Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
"Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
"Magnitude": [4.5, 3.8, 5.2, 4.0, 6.1],
"Depth (km)": [10.0, 12.5, 8.0, 15.0, 20.0],
"Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(temporary_dict)
df_from_lists.to_csv("exercise_spreadsheet.csv")
df_from_csv = pandas.read_csv("exercise_spreadsheet.csv")
print(df_from_csv)
It is not the same because the old index got read in as an unnamed column, and also the new dataframe got a new index.
Fixing DataFrame Columns¶
We have already seen that creating DataFrame from a dictionary will result in columns that have the same labels as the keys of the dictionary.
df = pandas.DataFrame(dictionary)
- However, in that case, we didn't get to pick the order of the columns.
We can specify an order
or subset
of columns by listing column labels in the columns
keyword argument.
df = pandas.DataFrame(dictionary, columns=[...] )
The usecols
argument can be used to select columns when reading from files.
df = pandas.read_<filetype>(path_to_file, usecols=[...] )
Example 6¶
Creating a DataFrame as in Example 1 but with different columns.
temporary_dict = {
"Date": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"],
"Latitude": [34.05, 37.77, 40.71, 34.05, 35.68],
"Longitude": [-118.25, -122.42, -74.01, -118.25, 139.69],
"Magnitude": [4.5, 3.8, 5.2, 4.0, 6.1],
"Depth (km)": [10.0, 12.5, 8.0, 15.0, 20.0],
"Region": ["City1", "City2", "City3", "City4", "City5"]
}
df_from_lists = pandas.DataFrame(temporary_dict, columns=["Date","Magnitude","Region"])
print(df_from_lists)
Date Magnitude Region 0 2023-05-01 4.5 City1 1 2023-05-02 3.8 City2 2 2023-05-03 5.2 City3 3 2023-05-04 4.0 City4 4 2023-05-05 6.1 City5
Exercise 6¶
Read in the example spreadsheet as in Example 4 but specify columns as in Example 6. (Hint: usecols
)
#your code here
Click here to see solution
df_from_file = pandas.read_csv("example_spreadsheet.csv", usecols=["Date","Magnitude","Region"])
print(df_from_file)
Not quite the same. usecols
specifies a subset but not an order.
Fixing DataFrame Index¶
When we created a DataFrame from data, sometimes the intended index
(row labels) gets assigned to one of the columns.
The set_index
DataFrame method can modify a DataFrame to move one of the columns over to use it as the index
. It returns the new DataFrame.
new_df = df.set_index(column_label)
When reading in from a file, we can prevent the unwanted outcome by specifying a column to use as the index:
df = pandas.read_<filetype>(path_to_file, index_col= <column_label>)
Example 7¶
Writing a file and reading it back in causes the index to be misassigned.
The column 'Unnamed: 0'
(position 0 in the file) holds the row labels.
We can fix it using the set_index
method.
Specifying the index as column 0 when reading in prevents that.
# a quick 3x4 array
data_2d = numpy.arange(12).reshape(3, 4)
df_from_array = pandas.DataFrame(data=data_2d,
index=["row1", "row2", "row3"],
columns=["column1", "column2", "column3", "column4"]
)
# creating a file
df_from_array.to_csv("example_7.csv")
# reading back in
df_from_csv = pandas.read_csv("example_7.csv")
print(df_from_csv)
# set_index method
df_from_csv = df_from_csv.set_index("Unnamed: 0")
print(df_from_csv)
# keyword argument method
df_from_csv = pandas.read_csv("example_7.csv", index_col=0)
print(df_from_csv)
Unnamed: 0 column1 column2 column3 column4 0 row1 0 1 2 3 1 row2 4 5 6 7 2 row3 8 9 10 11 column1 column2 column3 column4 Unnamed: 0 row1 0 1 2 3 row2 4 5 6 7 row3 8 9 10 11 column1 column2 column3 column4 row1 0 1 2 3 row2 4 5 6 7 row3 8 9 10 11
The set_index
method keeps the column name intact, so the index has a label just like a column does.
Learn more about that in a future lesson.
Exercise 7¶
Create a DataFrame with the following three arrays as columns (given in the cell below). Don't set an index.
- Hint: put them in a dictionary.
Modify the DataFrame so that the dates column becomes the index. Use a "method chain".
- Hint:
set_index()
is a return method, so it can be chained with other functions that return DataFrames.
Yes, I know you could have just set the index using the index
keyword argument. The point of this exercise is to try something different.
#creating a date range index over 30 days
dates = pandas.date_range('2022-09-01', '2022-09-30')
#creating random prices
price1 = numpy.round(numpy.random.rand(30)*10 + 10, decimals=2)
price2 = numpy.round(numpy.random.rand(30)*10 + 10, decimals=2)
#your code here
Click here to see solution
#creating a date range index over 30 days
dates = pandas.date_range('2022-09-01', '2022-09-30')
#creating random prices
price1 = numpy.round(numpy.random.rand(30)*10 + 10, decimals=2)
price2 = numpy.round(numpy.random.rand(30)*10 + 10, decimals=2)
temporary_dict = {
'dates' : dates,
'price1' : price1,
'price2' : price2
}
df_with_index = pandas.DataFrame(temporary_dict).set_index('dates')
print(df_with_index)