Okay, okay, the title might be a little sensationalised. I have been using the R statistics package for processing the results of evolutionary runs since beginning my PhD 2 years ago. In that time, I have become familiar with the basic process to importing data, performing basic population statistics, mean, confidence intervals, etc, and plotting using ggplot. I’ve always felt that I could streamline the process though as I perform a great deal of preprocessing using Python. This typically involves combining multiple replicate runs into one data file and possibly even doing some basic statistics using the built-in functionality of Python.
Even as a dedicated Python user, I still appreciate data frames that R provides and the ability to work with data through query style interface. Although R has some strange functionality to me as a programmer, I put up with it to work with my data. Recently, I came across Pandas, the Python Data Analysis Library, found here. The library features similar functionality to R style data frames, with that I was hooked. Of course, switching over was not such an easy task and I spent my fair share of time searching through StackOverflow posts. In this article, I’ll present some of the features that I found immediately useful when moving from R to Python/Pandas for my data processing.
Getting Started with Pandas:
Getting a quick start with pandas is extremely easy. The following code snippet will import the library, create an empty data frame and create a data frame with a single column populated with three values. Appending data to a data frame is a little strange as you have to assign the result to a variable rather than just calling append()
.
import pandas # Create an empty dataframe df = pandas.DataFrame() # Create a second dataframe with some simple data. data = pandas.DataFrame({"A": range(3)}) # Append data to empty dataframe df = df.append(data) # Print out the result. print(df)
A 0 0 1 1 2 2
Adding a Column to a dataframe:
A common operation for processing data involves appending a single value for all rows in a data frame such as denoting a Trial number. This is done with the following command:
# Append a Trial column to the dataframe for multiple replicates. df['Trial'] = 1 print(df)
A Trial
0 0 1
1 1 1
2 2 1
Adding a row to a Data Frame:
Sometimes you may want to add a single row to a Data Frame. This is done using a Python dictionary with keys relating to specific rows. The following example creates an empty data frame and inserts a single row.
# Create the dataframe df = pandas.DataFrame(columns=("Trial","Generation","Max_Fitness","Avg_Fitness","Fitness_SEM","Upp_CI","Low_CI")) # Create a test piece of data data = {"Trial":1,"Generation":1,"Max_Fitness":10,"Avg_Fitness":5,"Fitness_SEM":.25,"Upp_CI":5.25,"Low_CI":4.25} # Append to the dataframe df = df.append(data, ignore_index=True) # Print the result print(df)
The output of this results in:
Trial Generation Max_Fitness Avg_Fitness Fitness_SEM Upp_CI Low_CI 0 1 1 10 5 0.25 5.25 4.25
Select a row by column value:
One specific functionality was difficult to find searching around, but is quite necessary in my day to day processing work. The ability to select a row or rows by column value is done with the following call:
# Hypothetically data is a data frame with a column Generation df.ix[df["Generation"] == 1000] print(df)
This results in the following output:
Trial Generation Max_Fitness Avg_Fitness Fitness_SEM Upp_CI Low_CI 0 1 1 10 5 0.25 5.25 4.25
The additional functions such as idxmax()
and ix()
with an index value provide other methods for selecting individual rows.
Conclusion:
Switching my data processing from R to Python has been a breeze so far with Pandas. While R is still a very capable language, I find myself struggling with it at times due to my familiarity with more conventional programming languages like C++ and Python. The functionality provided by Pandas has allowed me to do my data preprocessing and statistical analysis in Python while relegating R to creating plots with the ggplot2 package. In the future, this may even be simplified as there are efforts underway to port ggplot2 to Python itself!