Visualizing your data

In this practice, we’ll be exploring the danish data set again (and you may have to use dplyr as well). While thinking about the questions below, it will be helpful to open R and reproduce all the steps in your own computer.

In the first part (1-5), you will see plots, and will have to think about the code that generated them. In the second part (6-10), you will be given the description of what needs to be plotted, and will again have to come up with the code that creates an appropriate plot. Suggested answers are provided, just like practice 1 and 2.

Part 1: What code generated these plots?

The scatter plot below shows the relationship between reaction time and uniqueness point for the words in the danish data set (languageR package). In addition, a trend line has been added which simulates a linear regression model. Hint: when you add the line, you will have to set method to "lm". Also, note that the scatter plot has some transparency.

library(dplyr)
library(ggplot2)
library(languageR)

data(danish)

# To make things faster, you can assign a plot to a variable, and then add layers to the variable instead of repeating all previous layers. In fact, you could assign the very first layer (ggplot()) to a variable (which alone won't plot anything), and then simply add layers to that variable (to make the actual plot)---see below. You'll see why this is useful in question 2.

myPlot = ggplot(data = danish, aes(x = LogUP, y = LogRT)) + 
          geom_point(alpha = 0.2) + 
          geom_smooth(method = "lm") +
          labs(x = "Uniqueness point (log)", y = "Reaction time (log)")

myPlot

# Alternatively:

# myPlot = ggplot(data = danish, aes(x = LogUP, y = LogRT))

# myPlot + geom_point(alpha = 0.2) + geom_smooth(method = "lm") etc.

Now, the same relationship across the variable Sex—which doesn’t seem to make any difference.

myPlot + facet_grid(~Sex, labeller = "label_both")

# Because myPlot has all the layers from question 1, you just need to add one more layer to produce the plot above. Note the labels in each facet.

The boxplots below compare the reaction times across PrevError, i.e., whether the participant got the previous word right. No apparent effect of PrevError on reaction time.

ggplot(data = danish, aes(x = PrevError, y = LogRT)) + 
  geom_boxplot() + 
  labs(x = "Performance on previous trial", y = "Reaction time (log)")

The plot below shows the relationship between affixes and reaction times. Error bars have been added (standard errors) as well as points to indicate the mean reaction time by affix. Note that the plot also ordered the affixes by reaction time, so you need to take that into account as well. Finally, the background colour is no longer gray.

danish$Affix2 = reorder(danish$Affix, danish$LogRT)

ggplot(data = danish, aes(x = Affix2, y = LogRT)) +
  stat_summary(fun.data = "mean_se", geom = "errorbar") +
  stat_summary(fun.y = "mean", geom = "point") +
  labs(x = "Affix", y = "Reaction time (log)") +
  theme_bw()

The plot below shows the relationship between affixes and the mean word frequency (log). Error bars have been added (standard errors). Note that the plot also ordered the affixes by mean word frequency, so you need to take that into account as well. Don’t worry about the details (e.g., the y-axis ranges from 2 to 7 in the plot; the axes labels are more distant from the plot than usual; the x-axis labels are angled; the colours are different etc.): focus on getting the main aspects of the plot. Then check the code below and compare.

Hint: install the ggthemes package. Then, add theme_fivethirtyeight() to your plot to get the background in the plot below. Then, to use the colours correctly, define the fill and the line colours as follows:

fill = "#4271AE"
line = "#1F3552"

Click here to check HEX colour codes (this can be quite useful if you plan to create a template and use specific colours).

Finally, when you create your bars, simply tell R to use fill as the fill colour and line as the line colour. Remember that in ggplot2 the argument for changing the colour of you lines is colour. If you find this confusing, don’t worry: it is. As mentioned above, focus on trying to get the bars and the error bars. Then on getting the ascending order right. Only worry about the looks at the very end.

library(ggthemes)

fill = "#4271AE"
line = "#1F3552"

danish$Affix3 = reorder(danish$Affix, danish$LogWordFreq)

ggplot(data = danish, aes(x = Affix3, y = LogWordFreq)) + 
  stat_summary(fun.y = "mean", geom = "bar", alpha = 0.7, 
               width = 0.75, fill = fill, color = line) +
  stat_summary(fun.data = "mean_se", geom = "errorbar", size = 0.4, width = 0.3) +
  theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_cartesian(ylim = c(2,7))

Part 2: Transforming to plot

It’s often the case that you need to transform your data before plotting it. You might want to see a pattern that is not directly shown by your data. This is extremely common, and that’s why using dplyr and ggplot2 together is a great option. Some summarisation can be achieved with ggplot2 alone (e.g., using stat_summary()), but depending on the complexity of what you want to plot more work will be necessary. In this part we will practice different types of transformation.

Let’s start with a simple task. Let’s say you want to plot the % of correct answers (PrevError) by speaker. One way to do it is to use a histogram (or a bar plot) to plot the counts, and then add a layer to your plot that transforms counts into percentages (or include the calculation in your plot). As always, there are several different ways to accomplish this. In this case, let’s practice doing that with dplyr and then plotting it.

group the data by speaker,
calculate the counts
based on(b), calculate the %s.
filter by type of answer (remember: you want to plot the % of correct responses)
plot using geom_point(). You may want to angle your x-axis.
finally, order your subjects by % of correct answers

As usual, do not worry about the details (like how to use % in the y-axis). Only worry about that if everything else is clear.

library(scales) # For % in plot

new = danish %>% 
  group_by(Subject, PrevError) %>% 
  summarise(n = n()) %>%                      # counts
  mutate(Prop = n/sum(n)) %>%                 # %
  filter(PrevError == "CORRECT")              # keep only "CORRECT"

new$Subject2 = reorder(new$Subject, new$Prop) # order subjects by % of "CORRECT"

ggplot(data = new, aes(x = Subject2, y = Prop)) + 
  geom_point() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Subject", y = "% of correct response\non previous trial") +
  scale_y_continuous(labels = percent_format())

Sometimes you just want a histogram to take a look at the distribution of values you have. How do you generate a simple histogram for LogRT? Try generating a histogram with backtransforming LogRT and see how that changes the distribution.


# This is a histogram with LogRT

ggplot(data = danish, aes(x = LogRT)) + 
  geom_histogram(color = "white", bins = 30) + 
  theme_bw()


# This backtransforms LogRT (note the skew)

ggplot(data = danish, aes(x = exp(LogRT))) + 
  geom_histogram(color = "white", bins = 30) + 
  theme_bw()

How do you produce the histogram from question 7 as a density plot?

ggplot(data = danish, aes(x = LogRT)) + 
  geom_density(fill = "darkseagreen", alpha = 0.3) +
  theme_bw()

In question 3, we created a boxplot. Now, do the same using violin plots, but add facets for Sex. Are the data (roughly) normally distributed within each level of PrevError and Sex?

ggplot(data = danish, aes(y = LogRT, x = PrevError)) + 
  geom_violin(alpha = 0.2) +
  facet_grid(~Sex) +
  theme_bw()


# Yes, rouhgly, but when PrevError == ERROR, there's a considerable asymmetry (skewed distribution). Violin plots are great to quickly check if distributions are normal *across* multiple conditions

This last question is more complex, so it has an introduction with some instructions.

Intro (not exactly your task, but necessary for plotting later)

You may have noticed that danish does not have a column that tells us whether the participant got the current word correctly (we only have PrevError). This is unfortunate, so let’s simulate such a column to use it in ggplot. How can we do that?

Easy: R is great for simulation. In this particular case, we want to randomly assign CORRECT and ERROR to our data. Let’s do that first. In the code below, we are telling R to randomly sample CORRECT and ERROR x number of times (where x is the number of rows in danish). This is sampling with replacement, given that we only have two possible values, so we set replace to TRUE. Finally, we can even tell R the probabilities of each level in this factor (note that R will treat this as a character column, so we add as.factor() to the code below). In this case, we want to simulate these values such that CORRECT is more likely than ERROR. More specifically, there’s a 70% probability of randomly generating a CORRECT and a 30% probability of generating an ERROR.


danish$CurrError = as.factor(sample(c("CORRECT", "ERROR"), 
                              size = nrow(danish), 
                              replace = TRUE, 
                              prob = c(0.7, 0.3)))

# Let's now see if the probabilities are mirrored in the simulated data:

danish %>% group_by(CurrError) %>% 
  summarise(n = n()) %>% 
  mutate(Freq = n / sum(n)) %>% 
  select(-n)
## # A tibble: 2 x 2
##   CurrError      Freq
##      <fctr>     <dbl>
## 1   CORRECT 0.6975346
## 2     ERROR 0.3024654


# Great. Now let's move on!

Now, pretend that these data are real, so we know how participants performed in the current trial.

This is your task

The plot in question 6 is quite informative, but let’s say you want to take into account the variation of correct answers by affix. In the plot below, we see the mean % of correct responses in our simulated current trial column, but we also see the standard error when we take into account the different %s across affixes (in this case, we won’t see much variation in standard errors, given that we have not added any noise by affix to our random data imputation above). This can tell us how much certainty we have regarding each subject’s performance considering the different affixes.

Hint: This type of transformation is a complex at first, but you quite often need to do something similar to it. If you have any questions, refer back to the tutorial or email me. More importantly, the code below shows you how flexible dplyr can be: we’re doing several things through a pipeline of operations, all at once (including ggplot, which until now was a separate step). More importantly, study this example and use it as a “template”. Hint: This type of transformation is a complex at first, but you quite often need to do something similar to it. If you have any questions, refer back to the tutorial or email me. More importantly, the code below shows you how flexible dplyr can be: we’re doing several things through a pipeline of operations, all at once (including ggplot, which until now was a separate step). More importantly, study this example and use it as a “template”.

library(tidyr)      # To use the complete() function below
library(ggthemes)   # For aesthetics, just like above
library(scales)     # For % in the y-axis

# You can add your ggplot() to the end of a pipeline of operations. 


new2 = danish %>%
  group_by(Subject, Affix, CurrError) %>%
  summarise(n = n()) %>%
  mutate(Prop = n/sum(n)) %>%
  complete(Subject, Affix, CurrError) %>%
  filter(CurrError == "CORRECT") %>%
  ungroup() %>%
  mutate(Subject2 = reorder(Subject, Prop)) %>%
  ggplot(data = new2, aes(x = Subject2, y = Prop)) +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = 0.3, size = 0.5) +
  theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = percent_format())


# This is a single pipeline that does several things at once (including the plot). 

# Because we've ordered the subjects, you're inclined to see a pattern there, even though the data were randomly generated.

# The complete() bit above is very important. Try plotting without that line of code and see what happens.

# Note that there's a syntactic break between mutate and ggplot, after which you need to use "+" to concatenate layers in your plot. ggvis, the next generation of plots in R, uses %>% instead of "+", so its syntax will be much more consistent with what dplyr uses.

Intro to data analysis using R: Practice [3]

Guilherme D. Garcia

guilhermegarcia.github.io

Visualizing your data

Part 1: What code generated these plots?

Part 2: Transforming to plot