Tag Archives: R

R Quick Tip: Use %in% to filter a data frame.

Working with R, I was looking for functionality to easily subset my data based on a sequence of numbers.  After writing a for loop and using rbind to do it initially (terrible to do in R!), I finally found a way to do this efficiently.  Using a command called %in%, you can easily apply it as a filter in the subset command to get data filtered based on your sequence.  Enjoy!

# Generate sample data based to test.
sample_data <- data.frame(ID=seq(1,100,1),
                          Score=sample(0:100,100,rep=TRUE))
summary(sample_data)

# Plot the scores, see that there is a score for each id.
plot(sample_data$Score~sample_data$ID)

# Create a filter to apply.
look_at <- seq(1,100,10)

# Filter the sample data by look_at using the %in% command.
subset_data <- subset(sample_data, ID %in% look_at)

# Plot the scores, note the filtered data.
plot(subset_data$Score~subset_data$ID)

Starting Quirks with Pandas from an R Junkie

Okay, okay, the title might be a little sensationalised.  I have been using the R statistics package for processing the results of evolutionary runs since beginning my PhD 2 years ago.  In that time, I have become familiar with the basic process to importing data, performing basic population statistics, mean, confidence intervals, etc, and plotting using ggplot.  I’ve always felt that I could streamline the process though as I perform a great deal of preprocessing using Python.  This typically involves combining multiple replicate runs into one data file and possibly even doing some basic statistics using the built-in functionality of Python.

Continue reading

Escaping the Spreadsheet Mentality: Start with the Right Data Format

Growing up on a healthy diet of Microsoft Office products, I am well versed in Word, Excel and Powerpoint.  As I have transitioned into the research world, these products still have their place, however, I sometimes find that the habits I developed for organizing data doesn’t necessarily transfer to statistical analysis.  Recently, I ran into a situation where I was evaluating the performance of solutions in multiple different environments.  Organizing this data appeared straightforward to me at first, I would simply group the different environments into one row grouped by the id of the individual.  My data then looked something like this:

GenerationEnvironment 1Environment 2Environment 3
110.312.18.2
214.110.27.4
38.613.410.2
49.811.29.3

Continue reading