Escaping the Spreadsheet Mentality: Start with the Right Data Format

Growing up on a healthy diet of Microsoft Office products, I am well versed in Word, Excel and Powerpoint.  As I have transitioned into the research world, these products still have their place, however, I sometimes find that the habits I developed for organizing data doesn’t necessarily transfer to statistical analysis.  Recently, I ran into a situation where I was evaluating the performance of solutions in multiple different environments.  Organizing this data appeared straightforward to me at first, I would simply group the different environments into one row grouped by the id of the individual.  My data then looked something like this:

GenerationEnvironment 1Environment 2Environment 3
110.312.18.2
214.110.27.4
38.613.410.2
49.811.29.3

At the time, it seemed completely logical to organize it this way since I could add in a variable number of environments.  It’s simple to understand from a human perspective.  The difficulties cropped up when I went to use my favorite plotting package, ggplot2, and R.  Importing the data was pretty straightforward still using the read.csv() command.  However, plotting the results did not go at all as planned due to my attempts to integrate variable numbers of environments.

data_plot     labs(title="Fitness over Generations", x="Generation", y="Fitness") +
    xlim(min=0,max=400) +
    ylim(min=0,max=20) +
    theme(legend.position="none")

# Add lines to the plot.
for(i in 2:ncol(all_data)) {
    data_plot }

This approach seemed straightforward, the for loop simply adds a geom_line() to the data_plot at each iteration.  However, when actually viewing the plot, I was left with only the final line added, rather than three in this case.  I was stumped as to the behavior I was seeing from the code, and even consulted Brian Connelly, my goto R guru.  He couldn’t make sense of what was happening either.  After much head scratching, I did some searching and came across this stackoveflow post.  Apparently, ggplot does not work with data this way and this code results in the single geom_line() being added to the plot.  Exactly why this occurs I haven’t been able to figure out yet.

With this new information in hand, I proceeded to reorganize my data into a three column format as follows:

GenerationFitnessEnvironment
110.3Environment_1
214.1Environment_1
38.6Environment_1
49.8Environment_1
112.1Environment_2
210.2Environment_2
313.4Environment_2
411.2Environment_2
18.2Environment_3
27.4Environment_3
310.2Environment_3
49.3Environment_3

Now, I went ahead with my normal R/ggplot2 process of loading the data and plotting it using the following code:

data_plot                     group=Environment, color=Environment)) +
    labs(title="Fitness over Generations", x="Generation", y="Fitness") +
    xlim(min=0,max=400) +
    ylim(min=0,max=20) +
    theme(legend.position="none") +
    geom_line() +
    scale_colour_manual(values = cbPalette)

Of course, this is the conventional way of handling data in R, letting ggplot handle the grouping by columns, but I had gotten ahead of myself organizing the data originally.  In hindsight, thinking about data organization the way my statistics software handles it would have saved me an hour or so of frustration and head scratching.  Sometimes, it’s just hard to break that spreadsheet mentality!  Do yourself a favor and organize your data for easy analysis, even though it may not be intuitive to the human eye.

Leave a Reply

Your email address will not be published. Required fields are marked *

Are you a spammer? *