Tag Archives: Data Analysis

Color Points by Factor with Bokeh

Bokeh (https://bokeh.pydata.org/en/latest/) has been on my radar for some time as I move my data processing primarily to Jupyter notebooks.  The look and feel of the plots have sensible defaults and generally are visually pleasing without too much customization.  Compared to matplotlib, I find that I need to do much less customization to get my final product.

Unfortunately, sometimes the process of generating a plot isn’t a one-to-one mapping with my prior experiences.  One such area of difficulty recently was generating a plot with four treatments, coloring each group of circles independently.  After much trial and error, the following code generated a rough plot I was happy with.

from bokeh.io import output_notebook
from bokeh.palettes import brewer
from bokeh.plotting import figure, show
import pandas

# Assumes df => data frame with columns: X_Data, Y_Data, Factor

# Create colors for each treatment 
# Rough Source: http://bokeh.pydata.org/en/latest/docs/gallery/brewer.html#gallery-brewer
# Fine Tune Source: http://bokeh.pydata.org/en/latest/docs/gallery/iris.html

# Get the number of colors we'll need for the plot.
colors = brewer["Spectral"][len(df.Factor.unique())]

# Create a map between factor and color.
colormap = {i: colors[i] for i in df.Factor.unique()}

# Create a list of colors for each value that we will be looking at.
colors = [colormap[x] for x in df.Factor]

# Generate the figure.
output_notebook()
p = figure(plot_width=800, plot_height=400)

# add a circle renderer with a size, color, and alpha
p.circle(df['X_Data'], df['Y_Data'], size=5, color=colors)

# show the results
show(p)

The general process is to first get a color palette from bokeh.palettes.brewer.  I selected the number of colors based on how many unique values existed in the Factor column.  Then I created a map from the values in the column and the colors.  Next, create a new list that maps each data point to a color, and use this when plotting using the circle call.

You should get something similar to the following figure based on what data you have to import.  Enjoy!

Add color to your plots by factor!

Add color to your plots by factor!

(Bokeh 0.12.7)

Getting iPython Notebook to Run “Correctly” in Mac OS X 10.8

I’m going to keep this post brief so that the steps are clear and concise.  The reason for writing this post is that I wanted to get iPython Notebook, a powerful tool for data analysis, to run with plotting and pandas in Mac OS X 10.8.  When I initially tried to get this running, I would encounter errors where there were conflicts between 32-bit and 64-bit installations of different packages.  After a good deal of trial and error, I found the following steps resulted in a full iPython Notebook environment with Pandas and Matplotlib functioning flawlessly.

Continue reading

R Quick Tip: Use %in% to filter a data frame.

Working with R, I was looking for functionality to easily subset my data based on a sequence of numbers.  After writing a for loop and using rbind to do it initially (terrible to do in R!), I finally found a way to do this efficiently.  Using a command called %in%, you can easily apply it as a filter in the subset command to get data filtered based on your sequence.  Enjoy!

# Generate sample data based to test.
sample_data <- data.frame(ID=seq(1,100,1),
                          Score=sample(0:100,100,rep=TRUE))
summary(sample_data)

# Plot the scores, see that there is a score for each id.
plot(sample_data$Score~sample_data$ID)

# Create a filter to apply.
look_at <- seq(1,100,10)

# Filter the sample data by look_at using the %in% command.
subset_data <- subset(sample_data, ID %in% look_at)

# Plot the scores, note the filtered data.
plot(subset_data$Score~subset_data$ID)