Starting Quirks with Pandas from an R Junkie

Okay, okay, the title might be a little sensationalised.  I have been using the R statistics package for processing the results of evolutionary runs since beginning my PhD 2 years ago.  In that time, I have become familiar with the basic process to importing data, performing basic population statistics, mean, confidence intervals, etc, and plotting using ggplot.  I’ve always felt that I could streamline the process though as I perform a great deal of preprocessing using Python.  This typically involves combining multiple replicate runs into one data file and possibly even doing some basic statistics using the built-in functionality of Python.

Even as a dedicated Python user, I still appreciate data frames that R provides and the ability to work with data through query style interface.  Although R has some strange functionality to me as a programmer, I put up with it to work with my data.  Recently, I came across Pandas, the Python Data Analysis Library, found here.  The library features similar functionality to R style data frames, with that I was hooked.  Of course, switching over was not such an easy task and I spent my fair share of time searching through StackOverflow posts.  In this article, I’ll present some of the features that I found immediately useful when moving from R to Python/Pandas for my data processing.

Getting Started with Pandas:

Getting a quick start with pandas is extremely easy.  The following code snippet will import the library, create an empty data frame and create a data frame with a single column populated with three values.  Appending data to a data frame is a little strange as you have to assign the result to a variable rather than just calling append().

import pandas

# Create an empty dataframe
df = pandas.DataFrame()

# Create a second dataframe with some simple data.
data = pandas.DataFrame({"A": range(3)})

# Append data to empty dataframe
df = df.append(data)

# Print out the result.
print(df)
  A
0 0
1 1
2 2

Adding a Column to a dataframe:

A common operation for processing data involves appending a single value for all rows in a data frame such as denoting a Trial number.  This is done with the following command:

# Append a Trial column to the dataframe for multiple replicates.
df['Trial'] = 1

print(df)

   A Trial
0 0 1
1 1 1
2 2 1

Adding a row to a Data Frame:

Sometimes you may want to add a single row to a Data Frame.  This is done using a Python dictionary with keys relating to specific rows.  The following example creates an empty data frame and inserts a single row.

# Create the dataframe
df = pandas.DataFrame(columns=("Trial","Generation","Max_Fitness","Avg_Fitness","Fitness_SEM","Upp_CI","Low_CI"))

# Create a test piece of data
data = {"Trial":1,"Generation":1,"Max_Fitness":10,"Avg_Fitness":5,"Fitness_SEM":.25,"Upp_CI":5.25,"Low_CI":4.25}

# Append to the dataframe
df = df.append(data, ignore_index=True)

# Print the result
print(df)

The output of this results in:

  Trial Generation Max_Fitness Avg_Fitness Fitness_SEM Upp_CI Low_CI
0 1     1          10          5           0.25        5.25   4.25

Select a row by column value:

One specific functionality was difficult to find searching around, but is quite necessary in my day to day processing work.  The ability to select a row or rows by column value is done with the following call:

# Hypothetically data is a data frame with a column Generation
df.ix[df["Generation"] == 1000]
print(df)

This results in the following output:

     Trial Generation Max_Fitness Avg_Fitness Fitness_SEM Upp_CI Low_CI
0     1    1          10          5           0.25        5.25   4.25

The additional functions such as idxmax() and ix() with an index value provide other methods for selecting individual rows.

Conclusion:

Switching my data processing from R to Python has been a breeze so far with Pandas.  While R is still a very capable language, I find myself struggling with it at times due to my familiarity with more conventional programming languages like C++ and Python.  The functionality provided by Pandas has allowed me to do my data preprocessing and statistical analysis in Python while relegating R to creating plots with the ggplot2 package.  In the future, this may even be simplified as there are efforts underway to port ggplot2 to Python itself!

Leave a Reply

Your email address will not be published. Required fields are marked *

Are you a spammer? *