November 23, 2018

EARL conference recap: Seattle 2018

I had the pleasure of attending the EARL (Enterprise Applications of the R Language) Conference held in Seattle on 2018-11-07, and the honour of being one of the speakers. The EARL conferences occupy a unique niche in the R conference universe, bringing together the I-use-it-at-work contingent of the R community. The Seattle event was, from my perspective (I use R at work, and lead a team of data scientists that uses R) a fantastic conference. Full marks to the folks from Mango Solutions for organizing it!

The conference started with a keynote, “Text Mining with Tidy Data Principles”, from the always-brilliant Julia Silge. She’s an undisputed leader in the field of text analysis with R (the book she co-authored with David Robinson, Text Mining with R: A Tidy Approach, is already a cornerstone resource), and although I’d heard her deliver some of the same material at the Joint Statistical Meetings in July, this talk
  1. was longer and
  2. introduced some of her thinking about problems she’s tackling at her job at Stack Overflow.
It was fascinating to see where the utility of R as a text analysis tool is going, and Julia’s engaging manner, energy, and enthusiasm was a great start to the day.

Next up was a panel of leaders in the R community, on “Examining the future of R in industry”. The panelists were:
  • the aforementioned Julia Silge,
  • David Smith from Microsoft (he has the title “Cloud Developer Advocate–AI & Data Science”, but he’s also famous in the R community for his editorship and contributions to the Revolutions blog), and
  • Joe Cheng (the creator of Shiny and the CTO and Shiny team lead at RStudio).
With a trio of this calibre it was no surprise they had a wide-ranging and thoughtful discussion of the questions from the floor, covering everything from the pros and cons of different open source licenses to implementing R into production environments. The panel seemed, in my opinion, to land on a consensus that the future of R is bright, and that we will continue to see it remain specialized as a data science tool, and that we will continue to see integration with other tools.

The rest of the day was dedicated to the presentations, which covered a wide range of topics from modeling the relationship between roadway speed (from Joonbum Lee at Battelle Memorial Institute) and quantitative risk assessment at Starbucks (David Severski) to using deep learning on satellite images (Damian Rodziewicz of Appsilon Data Science). All of the speakers were engaging, had a great perspective on their topics, and only one (full disclosure: me) nattered on and didn’t leave any time for questions from the floor.

Intending no slight to the other speakers, three presentations really struck a nerve with me.
Eina Ooka from The Energy Authority spoke about her experience moving to R (and all of the benefits, from reproducibility to accuracy) in what she termed an “Excel-pervasive” environment. The space I work in is much the same; Excel is a workhorse for a lot of numeric analysis, and it is a go-to tool for many people in the clients we serve. Some of those clients expect delivery of their data tables in an Excel file. Eina’s success tackling the transition, in spite of the hurdles she faced, was inspiring.

Stephanie Kirmer from Uptake delivered what was, to me, perhaps the most immediately relevant talk: “The case for R packages as team collaboration tools”. I particularly liked the matrix showing the “Progression of Team Collaboration Infrastructure”, with version control, code sharing, and code storage and dissemination at four levels of sophistication. I was struck by how far my colleagues and I have to go to move up the ladder, but immediately recognized at least one project where a package would be an ideal way for us to start to collaborate more effectively.

And finally Aimee Gott from Mango Solutions, whose closing talk “Building a data science teams with R” was the perfect summary of everything that had preceded it. Again, it was a typology that stuck with me–in this case, types of R users, from the Super Users to the Cut & Paste Tweakers.
In short, the conference was a great way to hear from and meet R users who are finding applications for it in a business (or in my case, government) setting. Thanks again to Mango Solutions.

The 2018 EARL road show continued on to Houston (2018-11-09) and Boston (2018-11-13), each with different slates of speakers.

My only hope is that next year’s EARL road show makes a stop in Canada!

Note: looking for the slides and full narrative of my talk?
Bonus note: this post can be found in the B.C. Government GitHub repo dedicated to public presentations on the topic of R.


-30-

August 31, 2018

Smoke from a distant fire

Forest fires and air quality

August 31, 2018


It was recently announced that during 2018, British Columbia has seen the most extensive forest fire season on record. As I write this (2018-08-31) there are currently 442 wildfires burning in British Columbia. These fires have a significant impact on people’s lives–many areas are under evacuation order and evacuation alert, and there are reports that homes have been destroyed by the blazes.

The fires also create a significant amount of smoke, which has been pushed great distances by the shifting winds. This includes the large population centres of Vancouver and Victoria in British Columbia, as well as the Seattle metropolitan region and elsewhere in Washington. (Clifford Mass, Professor of Atmospheric Sciences at the Universtiy of Washington in Seattle, has written extensively about the smoke events in the region; see for example Western Washington Smoke: Darkest Before the Dawn from 2018-08-22.)

The Province of British Columbia has many air quality monitoring stations around the province, and makes the data available. The measure most used for monitoring the effects on health is PM25 or PM2.5, for fine particles with a diameter of 2.5 microns (millionths of a metre). The B.C. government has a Current Particulate Matter map that colour codes the one hour average measures for all the testing stations around the province.

The data file and a simple plot


The DataBC Catalogue provides access to air quality data. There’s “verified” to the end of 2017, and “unverified” for the past 30 days. Since we want to see what happened this month, it’s the latter we want. (The page with the links to the raw files is here.)

The files are arranged by particulate or gas type; there’s a table for ozone and another for sulpher dioxide, and others for the particulate matter. Note that the data are made available under the Province of B.C.’s Open Data license, and are in nice tidy form. And the date format is ISO 8601, which makes me happy.

To make sure we’ve got a reproducible version, I’ve saved the file I downloaded early this morning to my google drive. The link to the folder is here.

For the first plot, let’s look at the PM2.5 level for my hometown of Victoria, B.C. The code below loads the R packages we'll use, reads the data, and generates the plot.


# tidyverse packages
library(tidyverse)
library(glue)

PM25_data <- readr::read_csv("PM25_2018-08-31.csv")
filter(STATION_NAME == "Victoria Topaz") %>% ggplot() + geom_line(aes(x = DATE_PST, y = REPORTED_VALUE)) + labs(x = "date", title = glue("Air quality: Victoria Topaz"), subtitle = "one hour average, µg/m3 of PM2.5", caption = "data: B.C. Ministry of Environment and Climate Change Strategy")



There are 61 air quality monitoring stations around British Columbia. It would be interesting to see how the air quality was in other parts of the region–and since over half (54% in 2017) of the province’s population lives in the Vancouver Census Metropolitan Area (CMA), let’s plot the air quality there. There are multiple stations in the Vancouver CMA, so I chose the one at Burnaby South…it’s fairly central in the region.We can run this line of code to see a listing of all 61 stations (but we won’t do that now…)

# list all the air quality stations for which there is PM2.5 data
unique(PM25_data$STATION_NAME)

And since we’re going to be doing this often, let’s wrap the code that filters for the location we want and runs the plot in a function. Note that we’ll create a new variable station_name so all we need to do to change the plot is assign the name of the station we want, and off we go. Not only does this simplify our lives now, but is all-but-essential for a Shiny application.

# the air quality plot
PM25_plot <- function(datafile, station_name){
  datafile %>%
  filter(STATION_NAME == station_name) %>%
  ggplot() +
  geom_line(aes(x = DATE_PST, y = REPORTED_VALUE)) +
  labs(x = "date",
       title = glue("Air quality: ", station_name),
       subtitle = "one hour average, µg/m3 of PM2.5", 
       caption = "data: B.C. Ministry of Environment and Climate Change Strategy")
}

Now that we've got the function, the code to create the plot for the Burnaby South station is significantly simplified: assign the station name, and call the function.

# our Burnaby plot
station_name <- "Burnaby South"

PM25_plot(PM25_data, station_name)




And what about the towns that are the closest to the fires? While there are fires burning across the province, the fires that are burning the forests of the Nechako Plateau have understandably received a lot of attention. You may have seen the news stories and images from Prince George like this and this, or the images of the smoke plume from the NASA Worldview site.

Prince George is east of many major fires, downwind of the prevailing westerly winds. So what has the air quality in Prince George been like?

station_name <- "Prince George Plaza 400"
PM25_plot(PM25_data, station_name)




Or still closer to the fires, the town of Burns Lake.

station_name <- "Burns Lake Fire Centre"
PM25_plot(PM25_data, station_name)



The town of Smithers is west of the fires that are burning on the Nechako Plateau and producing all the smoke experienced in Burns Lake and Prince George. The residents of Smithers have had a very different experience, only seeing smoke in the sky when the winds shifted to become easterly.

station_name <- "Smithers St Josephs" 
PM25_plot(PM25_data, station_name)




multiple stations in one plot

You may have noticed that the Y axis on the plots can be quite different–for example, Victoria reaches 300, Smithers gets to 400, and Prince George is double that at 800, and Burns Lake is more than double again. There are two ways we can compare multiple stations: a single plot, or faceted plots.

a line plot with four stations

station_name <- c("Burns Lake Fire Centre", "Prince George Plaza 400", 
                  "Smithers St Josephs", "Victoria Topaz")

PM25_data %>%
  filter(STATION_NAME %in% station_name) %>%
  ggplot() +
  geom_line(aes(x = DATE_PST, y = REPORTED_VALUE, colour = STATION_NAME)) +
  labs(x = "date",
       title = glue("Air quality: Burns Lake, Prince George, Smithers, Victoria"),
       subtitle = "one hour average, µg/m3 of PM2.5", 
       caption = "data: Ministry of Environment and Climate Change Strategy")




With four complex lines as we have here, it can be hard to discern which line is which. 

Use facets to plot the four stations separately


Facets give us another way to view the comparisons. In the first version, with the facets stacked vertically, it emphasizes comparisons on the X axis–that is, over time. In this way, we can see that the four locations have had smoke events that have occurred at different times.

station_name <- c("Burns Lake Fire Centre", "Prince George Plaza 400", 
                  "Smithers St Josephs", "Victoria Topaz")


PM25_data %>%
  filter(STATION_NAME %in% station_name) %>%
  ggplot() +
  geom_line(aes(x = DATE_PST, y = REPORTED_VALUE)) +
  facet_grid(STATION_NAME ~ .) +
  labs(title = glue("Air quality: Burns Lake, Prince George, Smithers, Victoria"),
       subtitle = "one hour average, µg/m3 of PM2.5", 
       caption = "data: B.C. Ministry of Environment and Climate Change Strategy") +
  theme(axis.text.x=element_text(size=rel(0.75), angle=90),
        axis.title = element_blank())



In the second version, the facets are placed horizontally, making comparisons on the Y axis clear. The smoke events in the four locations have been of very different magnitudes.
PM25_data %>%
  filter(STATION_NAME %in% station_name) %>%
  ggplot() +
  geom_line(aes(x = DATE_PST, y = REPORTED_VALUE)) +
  facet_grid(. ~ STATION_NAME) +
  labs(title = glue("Air quality: Burns Lake, Prince George, Smithers, Victoria"),
       subtitle = "one hour average, µg/m3 of PM2.5", 
       caption = "data: B.C. Ministry of Environment and Climate Change Strategy") +
  theme(axis.text.x=element_text(size=rel(0.75), angle=90),
        axis.title = element_blank())




These two plots show not only that Burns Lake and Prince George have had the most extreme smoke events, but that they have had sustained periods of poor air quality through the whole month. While the most extreme event in Smithers exceeds that of Victoria, there hasn’t been a prolonged period of smoke in the air like the other three locations.

-30-






April 3, 2017

Storytelling with Data: consumer alert

In March 2016 I gave a favourable review to a newly published book, Storytelling with Data by Cole Knaflic. Today (2017-04-03) she posted a new blog entry, "the book you're holding might be a fake!", in response to the discovery that poor-quality pirate editions of her books are available for purchase.

Knaflic's response has been exemplary: notifying people who bought the book what to look for in the knock-offs, and how to exchange the book for a proper copy. If you have a copy, go to the linked site above and check your copy against the description. (Fortunately, my copy is legit.)

This problem has alerted me to something I neglected to mention in my original review: the high quality of the physical book. The paper has a soft sheen which allows the print and graphics to stand out, the colours are sharp and consistent, and printing is clear.

A year later, my high opinion of this book has not shifted.

-30-

March 26, 2017

Updated Shiny app

A short post to alert the world that my modest Shiny application, showing Major League Baseball run scoring trends since 1901, has been updated to include the 2016 season. The application can be found here:
https://monkmanmh.shinyapps.io/MLBrunscoring_shiny/.

In addition to the underlying data, the update removed some of the processing that was happening inside the application, and put it into the pre-processing stage. This processing needs to happen only the once, and is not related to the reactivity of the application. This will improve the speed of the application; in addition to reducing the processing, it also shrinks the size of the data table loaded into the application.

The third set of changes were a consequence of the updates to the Shiny and ggplot2 packages in the two years that have passed since I built the app. In Shiny, there was a deprecation for "format" in the sliderInput widget. And in ggplot2, it was a change in the quotes around the "method" specification in stat_smooth(). A little thing that took a few minutes to debug! Next up will be some formatting changes, and a different approach to one of the visualizations.

 -30-

November 15, 2016

Subtitles and captions with ggplot2 v.2.2.0

Back in March 2016, I wrote about an extension to the R package ggplot2 that allowed subtitles to be added to charts. The process took a bit of fiddling and futzing, but now, with the release of ggplot2 version 2.2.0, it’s easy.

Let’s retrace the steps, and create a chart with a subtitle and a caption, the other nifty feature that has been added.

First, let’s read the packages we’ll be using, ggplot2 and the data carpentry package dplyr:

# package load 
library(ggplot2)
library(dplyr)

Read and summarize the data

For this example, we’ll use the baseball data package Lahman (bundling the Lahman database for R users), and the data table ‘Teams’ in it.

Once it’s loaded, the data are filtered and summarized using dplyr.
  • filter from 1901 [the establishment of the American League] to the most recent year,
  • filter out the Federal League
  • summarise the total number of runs scored, runs allowed, and games played
  • using `mutate`, calculate the league runs (leagueRPG) and runs allowed (leagueRAPG) per game
library(Lahman)
data(Teams)

MLB_RPG <- Teams %>%
  filter(yearID > 1900, lgID != "FL") %>%
  group_by(yearID) %>%
  summarise(R=sum(R), RA=sum(RA), G=sum(G)) %>%
  mutate(leagueRPG=R/G, leagueRAPG=RA/G)

A basic plot

You may have heard that run scoring in Major League Baseball has been down in recent years…but what better way to see if that’s true than by plotting the data?

For the first version of the plot, we’ll make a basic X-Y plot, where the X axis has the years and the Y axis has the average number of runs scored. With ggplot2, it’s easy to add a trend line (the geom_smooth option).

The scale_x_continuous options set the limits and breaks of the axes.

MLBRPGplot <- ggplot(MLB_RPG, aes(x=yearID, y=leagueRPG)) +
  geom_point() +
  geom_smooth(span = 0.25) +
  scale_x_continuous(breaks = seq(1900, 2015, by = 20)) +
  scale_y_continuous(limits = c(3, 6), breaks = seq(3, 6, by = 1))

MLBRPGplot




So now we have a nice looking dot plot showing the average number of runs scored per game for the years 1901-2015. (The data for the 2016 season, recently concluded, has not yet been added to the Lahman database.)

With the basic plot object now created, we can make the changes in the format.  In the past, the way we would set the title, along with X and Y axis labels, would be something like this.

MLBRPGplot +
  ggtitle("MLB run scoring, 1901-2014") +
  theme(plot.title = element_text(hjust=0, size=16)) +
  xlab("year") +
  ylab("team runs per game")


Adding a subtitle and a caption: the function

A popular feature of charts–particularly in magazines–is a subtitle that has a summary of what the chart shows and/or what the author wants to emphasize.

In this case, we could legitimately say something like any of the following:
  • The peak of run scoring in the 2000 season has been followed by a steady drop
  • Teams scored 20% fewer runs in 2015 than in 2000
  • Team run scoring has fallen to just over 4 runs per game from the 2000 peak of 5 runs
  • Run scoring has been falling for 15 years, reversing a 30 year upward trend
I like this last one, drawing attention not only to the recent decline but also the longer trend that started with the low-scoring environment of 1968.

How can we add a subtitle to our chart that does that, as well as a caption that acknowledges the source of the data? The new labs function, available in ggplot2 version 2.2.0, lets us do that.

Note that labs contains the title, subtitle, caption, as well as the X and Y axis labels.

MLBRPGplot +
  labs(title = "MLB run scoring, 1901-2015",
       subtitle = "Run scoring has been falling for 15 years, reversing a 30 year upward trend",
       caption = "Source: the Lahman baseball database", 
       x = "year", y = "team runs per game") 




Easy.

Thanks to everyone involved with ggplot2 who made this possible.

The code for this post (as an R markdown file) can be found in my Bayesball github repo.


-30-

March 14, 2016

Adding a subtitle to ggplot2

A couple of days ago (2016-03-12) a short blog post by Bob Rudis appeared on R-bloggers.com, "Subtitles in ggplot2". I was intrigued by the idea and what this could mean for my own plotting efforts, and it turned out to be very simple to apply. (Note that Bob's post originally appeared on his own blog, as "Subtitles in ggplot2".)
In order to see if I could create a plot with a subtitle, I went back to some of my own code drawing on the Lahman database package. The code below summarizes the data using dplyr, and creates a ggplot2 plot showing the annual average number of runs scored by each team in every season from 1901 through 2014, including a trend line using the loess smoothing method.
This is an update to my series of blog posts, most recently 2015-01-06, visualizing run scoring trends in Major League Baseball.

# load the package into R, and open the data table 'Teams' into the
# workspace
library(Lahman)
data(Teams)
#
# package load 
library(dplyr)
library(ggplot2)
#
# CREATE SUMMARY TABLE
# ====================
# create a new dataframe that
# - filters from 1901 [the establishment of the American League] to the most recent year,
# - filters out the Federal League
# - summarizes the total number of runs scored, runs allowed, and games played
# - calculates the league runs and runs allowed per game 

MLB_RPG <- Teams %>%
  filter(yearID > 1900, lgID != "FL") %>%
  group_by(yearID) %>%
  summarise(R=sum(R), RA=sum(RA), G=sum(G)) %>%
  mutate(leagueRPG=R/G, leagueRAPG=RA/G)

Plot the MLB runs per game trend

Below is the code to create the plot, including the formatting. Note the hjust=0 (for horizontal justification = left) in the plot.title line. This is because the default for the title is to be centred, while the subtitle is to be justified to the left.

MLBRPGplot <- ggplot(MLB_RPG, aes(x=yearID, y=leagueRPG)) +
  geom_point() +
  theme_bw() +
  theme(panel.grid.minor = element_line(colour="gray95")) +
  scale_x_continuous(breaks = seq(1900, 2015, by = 20)) +
  scale_y_continuous(limits = c(3, 6), breaks = seq(3, 6, by = 1)) +
  xlab("year") +
  ylab("team runs per game") +
  geom_smooth(span = 0.25) +
  ggtitle("MLB run scoring, 1901-2014") +
  theme(plot.title = element_text(hjust=0, size=16))

MLBRPGplot

MLB run scoring, 1901-2014

Adding a subtitle: the function

So now we have a nice looking dot plot showing the average number of runs scored per game for the years 1901-2014.

But a popular feature of charts--particularly in magazines--is a subtitle that has a summary of what the chart shows and/or what the author wants to emphasize.

In this case, we could legitimately say something like any of the following:
  • The peak of run scoring in the 2000 season has been followed by a steady drop
  • Teams scored 20% fewer runs in 2015 than in 2000
  • Team run scoring has fallen to just over 4 runs per game from the 2000 peak of 5 runs
  • Run scoring has been falling for 15 years, reversing a 30 year upward trend
I like this last one, drawing attention not only to the recent decline but also the longer trend that started with the low-scoring environment of 1968.

How can we add a subtitle to our chart that does that?

The function Bob Rudis has created quickly and easily allows us to add a subtitle. The following code is taken from his blog post. Note that the code for this function relies on two additional packages, grid and gtable. Other than the package loads, this is a straight copy/paste from Bob's blog post.

library(grid)
library(gtable)

ggplot_with_subtitle <- function(gg, 
                                 label="", 
                                 fontfamily=NULL,
                                 fontsize=10,
                                 hjust=0, vjust=0, 
                                 bottom_margin=5.5,
                                 newpage=is.null(vp),
                                 vp=NULL,
                                 ...) {
 
  if (is.null(fontfamily)) {
    gpr <- gpar(fontsize=fontsize, ...)
  } else {
    gpr <- gpar(fontfamily=fontfamily, fontsize=fontsize, ...)
  }
 
  subtitle <- textGrob(label, x=unit(hjust, "npc"), y=unit(hjust, "npc"), 
                       hjust=hjust, vjust=vjust,
                       gp=gpr)
 
  data <- ggplot_build(gg)
 
  gt <- ggplot_gtable(data)
  gt <- gtable_add_rows(gt, grobHeight(subtitle), 2)
  gt <- gtable_add_grob(gt, subtitle, 3, 4, 3, 4, 8, "off", "subtitle")
  gt <- gtable_add_rows(gt, grid::unit(bottom_margin, "pt"), 3)
 
  if (newpage) grid.newpage()
 
  if (is.null(vp)) {
    grid.draw(gt)
  } else {
    if (is.character(vp)) seekViewport(vp) else pushViewport(vp)
    grid.draw(gt)
    upViewport()
  }
 
  invisible(data)
 
}

Adding a subtitle

Now we've got the function loaded into our R workspace, the steps are easy:
  • Rename the active plot object gg (simply because that's what Bob's code uses)
  • Define the text that we want to be in the subtitle
  • Call the function

# set the name of the current plot object to `gg`
gg <- MLBRPGplot

# define the subtitle text
subtitle <- 
  "Run scoring has been falling for 15 years, reversing a 30 year upward trend"
 
ggplot_with_subtitle(gg, subtitle,
                     bottom_margin=20, lineheight=0.9)


MLB run scoring, 1901-2014 with a subtitle


Wasn't that easy? Thanks, Bob!

And it's going to get easier; in the few days since his blog post, Bob has taken this into the ggplot2 development environment, working on the code necessary to add this as a simple extension to the package's already extensive functionality. And Jan Schulz has chimed in, adding the ability to add a text annotation (e.g. the data source) under the plot. It's early days, but it's looking great. (See ggplot2 Pull request #1582.) Thanks, Bob and Jan!

And thanks also to the rest of the ggplot2 developers, for making those of us who use the package create good-looking and effective data visualization. Ain't open development great?

The code for this post (as an R markdown file) can be found in my Bayesball github repo.

-30-

March 6, 2016

Book review: Storytelling With Data

by Cole Nussbaumer Knaflic (2015, Wiley)

The Sabermetric bookshelf, #4

One of the great strengths of R is that there are some robust (and always improving) packages that facilitate great data visualization and tabular summaries. Beyond the capabilities built into the base version of R, packages such as ggplot2 (my favourite), lattice, and vcd and vcdExtra extend the possibilities for rendering charts and graphs, and a similar variety exist for reproducing tables. And accompanying these packages have been a variety of fine instruction manuals that delineate the code necessary to produce high-quality and reproducible outputs. (You can’t go wrong by starting with Winston Chang’s R Graphics Cookbook, and the R Graph Catalog based on Naomi Robbins’s Creating More Effective Graphs, created and maintained by Joanna Zhao and Jennifer Bryan at the University of British Columbia.)

Let’s call these the “how” resources; once you’ve determined you want a Cleveland plot (which are sometimes called “lollipop plots”—please, just stop it), these sources provide the code for that style of chart, including the myriad options available to you.

Elsewhere, there has been a similar explosion in the number of books that build on research and examples as to what makes a good graphic. These are the “what” books; the authors include the aforementioned William Cleveland and Naomi Robbins, and also include Stephen Few and Edward R. Tufte. Also making an appearance are books that codify the “why”, written by the likes of Alberto Cairo and Nathan Yau.

The recently published Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic falls into the latter category, and it’s one of the best I’ve seen to date. Although the subtitle indicates the intended audience, I believe that anyone involved in creating data-driven visualizations would benefit from reading and learning from it.

The book is relatively software agnostic, although Nussbaumer Knaflic recognizes the ubiquity of Excel and has used it to produce the charts in the book. She also provides some sidebar commentary and tools via her website http://www.storytellingwithdata.com/ specifically using Excel. For R users, this shouldn’t pose a particular challenge or barrier; the worst-case scenario is that it provides an opportunity to learn how to use R to replicate the book’s examples.

One of the strengths of the book is that Nussbaumer Knaflic takes the approach of starting with a chart (often a real-life example published elsewhere), and then iterates through one or more options before arriving at the finished product. One instance is the step-by-step decluttering of a line graph, which becomes substantially improved through a six step process. This example re-appears later, first in the chapter on the use of preattentive attributes and then again in the chapter titled “Think like a designer”. This approach reinforces the second of Nussbaumer Knaflic’s tips that close the book, “iterate and seek feedback”.

Nussbaumer Knaflic also introduces the Gestalt Principles of Visual Perception, and provides vivid examples of how these principles play out in data visualizations.

All of the discussion of graphics is wrapped in the context of storytelling. That is to say, the data visualization is always in the service of making a point about what the data tell us. In the context of business, this then translates into influencing decisions. The chapter “Lessons in storytelling” falls almost exactly in the middle of the book; after we’ve been introduced to the principles of making good data visualizations, Nussbaumer Knaflic gives us a way to think about the purpose of the visualization. With all of the pieces in place, the remainder of the book is focussed on the applications of storytelling with data.

The book is supported with Nussbaumer Knaflic’s site storytellingwithdata.com, which includes her blog. Check out her blog entry/discussion with Steven Few “Is there a single right answer?”), and some makeovers (in the Gallery) where she redraws some problematic charts that have appeared in the public domain.

All in all, Cole Nussbaumer Knaflic’s Storytelling with Data is a succinct and focussed book, one that clearly adopts and demonstrates the enormous value of the principles that it espouses. Highly recommended to anyone, of any skill level, who is interested in making effective data visualizations, and the effective use of those visualizations.

Cole Nussbaumer Knaflic’s Storytelling with Data: A Data Visualization Guide for Business Professionals was published in 2015 by Wiley.

Cross-posted at monkmanmh.github.io

References

Note: the authors of the following books have all published additional books on these topics; I’ve simply selected the ones that most closely fit with the context of this review. All are recommended.

  • Alberto Cairo, The Functional Art: An Introduction to Information Graphics and Visualization (2013) New Riders.
  • Winston Chang, R Graphics Cookbook (2012) O’Reilly.
  • William S. Cleveland, Visualizing Data (1993) Hobart Press.
  • Stephen Few, Signal (2015) Analytics Press.
  • Michael Friendly and David Meyer, Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data (2016) CRC Press.
  • Naomi B. Robbins, Creating More Effective Graphs (2004) Reprinted 2013 by Chart House.
  • Edward R. Tufte, The Visual Display of Quantitative Information (2nd edition, 2001) Graphics Press.
  • Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis (2nd edition, 2016) Springer.
  • Nathan Yau, Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2011) Wiley.