Exploring your data

In this practice, we’ll be exploring the danish data set using dplyr. While thinking about the questions below, it will be helpful to open R and reproduce all the steps in your own computer.

First, load dplyr, tibble (well, this will be loaded with dplyr anyway) and languageR (which contains the danish data set—read about these data here; p. 20). Then, load danish and visualize the first rows of the data.

library(dplyr)
library(tibble)
library(languageR)

data(danish)

# If you now type 'danish' and hit enter, R will print all the 3,000+ rows in 'danish'. R will print up to 10,000 rows when you call a data frame. This is definitely not helpful, so you normally only look at the first rows (e.g., by using head()).

head(danish)
##   Subject       Word Affix    LogRT        PC1        PC2 PrevError
## 1    2s14 appetitlig   lig 6.454239  0.5438602 -0.2330842   CORRECT
## 2    2s17 appetitlig   lig 6.842854  1.1290598 -0.4860440   CORRECT
## 3    2s15 appetitlig   lig 6.839958  0.6549460  0.2470086   CORRECT
## 4    2s04 appetitlig   lig 6.834507 -1.2602210 -2.4332345   CORRECT
## 5    2s06 appetitlig   lig 6.795191  0.1668297  0.4430017   CORRECT
## 6    2s11 appetitlig   lig 7.062680 -0.2500449  0.1067010   CORRECT
##         Rank Sex ResidSemRating ResidFamSize LogWordFreq LogAffixFreq
## 1  1.1293305   M      -1.234894      -1.6035    2.944439     13.40718
## 2 -0.8380413   M      -1.234894      -1.6035    2.944439     13.40718
## 3 -0.7572909   M      -1.234894      -1.6035    2.944439     13.40718
## 4  0.8503748   M      -1.234894      -1.6035    2.944439     13.40718
## 5  0.1015990   F      -1.234894      -1.6035    2.944439     13.40718
## 6  0.3732137   M      -1.234894      -1.6035    2.944439     13.40718
##     LogCUP   LogUP LogCUPtoEnd
## 1 6.463029 5.32301    4.304065
## 2 6.463029 5.32301    4.304065
## 3 6.463029 5.32301    4.304065
## 4 6.463029 5.32301    4.304065
## 5 6.463029 5.32301    4.304065
## 6 6.463029 5.32301    4.304065

Now, using as_data_frame(), create a new variable (dan). This will be a tibble equivalent to the data frame danish. Visualize the first rows of dan. What are the main differences you note between a data frame and a tibble here? Remember: as.data.frame() creates a typical data frame, whereas as_data_frame() creates a tibble.

dan = as_data_frame(danish)

dan
## # A tibble: 3,326 x 16
##    Subject Word  Affix LogRT    PC1    PC2 PrevError   Rank Sex  
##    <fct>   <fct> <fct> <dbl>  <dbl>  <dbl> <fct>      <dbl> <fct>
##  1 2s14    appe… lig    6.45  0.544 -0.233 CORRECT    1.13  M    
##  2 2s17    appe… lig    6.84  1.13  -0.486 CORRECT   -0.838 M    
##  3 2s15    appe… lig    6.84  0.655  0.247 CORRECT   -0.757 M    
##  4 2s04    appe… lig    6.83 -1.26  -2.43  CORRECT    0.850 M    
##  5 2s06    appe… lig    6.80  0.167  0.443 CORRECT    0.102 F    
##  6 2s11    appe… lig    7.06 -0.250  0.107 CORRECT    0.373 M    
##  7 2s12    appe… lig    7.47 -0.324 -1.50  CORRECT   -0.603 F    
##  8 2s21    appe… lig    6.93  0.812 -0.130 CORRECT   -0.508 F    
##  9 2s10    appe… lig    6.77 -0.241  0.522 CORRECT   -0.221 M    
## 10 2s03    appe… lig    6.71 -0.147  0.118 CORRECT    0.432 M    
## # ... with 3,316 more rows, and 7 more variables: ResidSemRating <dbl>,
## #   ResidFamSize <dbl>, LogWordFreq <dbl>, LogAffixFreq <dbl>,
## #   LogCUP <dbl>, LogUP <dbl>, LogCUPtoEnd <dbl>

Unlike the typical data frame, a tibble prints the first 10 rows of your data. In addition, it only shows the number of columns that fits on the screen (ommited variables are listed at the bottom of the output). As a result, tibbles tend to be easier to look at/work with. Another important difference is that tibble columns tell you the class of each variable. Here, we know that the first column is a factor (i.e., you don’t have to call str() to check that). At the top, it also tells you the dimensions of the table.

Now, let’s focus on dplyr. The nice thing about this package is that all the things you might want to do can be expressed using simple verbs. For example, select, filter, mutate, summarise, count, arrange. Besides, commands such as n() and n_distinct() make counting much easier.

Using dplyr, create a subset of dan called new which contains only the following variables: "Subject", "Word", "Affix", "LogRT", "PrevError", "Sex", "LogWordFreq", "LogUP".

new = select(dan, c(Subject, Word, Affix, LogRT, PrevError, Sex, LogWordFreq, LogUP))

# Remember: select() focuses on columns; filter focuses on rows.

Which dplyr commands would help you answer the following questions:

How many items are there by Affix? How many by Sex? How many speakers are there?
How many correct responses in preceding trials (PrevError) are there by Affix?
What is the mean RT (LogRT) by Sex?


# (a)
count(new, Affix)
count(new, Sex)
n_distinct(new$Subject)

# (b)
count(group_by(new, Affix, PrevError))

# (c)
summarise(group_by(new, Sex), meanRT = mean(LogRT))

How would you recode (c) above using multiple functions and the %>% operator? Why would you do that…?
Hint: Start out by telling R which data you want the functions to work on.

new %>% group_by(Sex) %>% summarise(meanRT = mean(LogRT))
## # A tibble: 2 x 2
##   Sex   meanRT
##   <fct>  <dbl>
## 1 F       6.76
## 2 M       6.78

The nice thing about concatenating operations is that you can accomplish numerous things all at once, and your code is more readable, given that a fairly intuitive verb is used in each step.

Using the %>% operator again:

What’s the average word frequency (LogWordFreq) by Affix type? Arrange the affixes in your output starting with the one with highest word frequency.
What’s the standard deviation of LogRT by speaker? Arrange the speakers starting with the one with lowest standard deviation.


# (a)
new %>% group_by(Affix) %>% summarise(meanWF = mean(LogWordFreq)) %>% arrange(desc(meanWF))
## # A tibble: 16 x 2
##    Affix meanWF
##    <fct>  <dbl>
##  1 et      6.56
##  2 ede     6.24
##  3 en      6.00
##  4 er      5.92
##  5 s       5.72
##  6 lig     5.57
##  7 isk     5.32
##  8 ende    4.65
##  9 som     4.54
## 10 ere     4.37
## 11 hed     4.32
## 12 ning    4.24
## 13 iv      4.20
## 14 est     3.93
## 15 eri     3.51
## 16 bar     3.06

# (b)
new %>% group_by(Subject) %>% summarise(meanSD = sd(LogRT)) %>% arrange(meanSD)
## # A tibble: 22 x 2
##    Subject meanSD
##    <fct>    <dbl>
##  1 2s21     0.122
##  2 2s15     0.128
##  3 2s09     0.129
##  4 2s11     0.131
##  5 2s05     0.146
##  6 2s06     0.155
##  7 2s02     0.161
##  8 2s10     0.164
##  9 2s19     0.164
## 10 2s16     0.169
## # ... with 12 more rows

Do you think people are faster in a lexical decision task when words are more frequent? Let’s create a table with the most common affixes (like (a) in question 6). Now, however, we’ll add an extra column with the mean reaction time by affix. In other words, you only need to adapt the code used above. Also, this time assign your output to a variable so we can refer back to it later. Arrange the table in descending order of the mean word frequency, just like above. This sounds much more complicated than it really is.

out = new %>% 
      group_by(Affix) %>% 
      summarise(meanWF = mean(LogWordFreq), meanRT = mean(LogRT)) %>% 
      arrange(desc(meanWF))

out
## # A tibble: 16 x 3
##    Affix meanWF meanRT
##    <fct>  <dbl>  <dbl>
##  1 et      6.56   6.76
##  2 ede     6.24   6.74
##  3 en      6.00   6.74
##  4 er      5.92   6.75
##  5 s       5.72   6.80
##  6 lig     5.57   6.73
##  7 isk     5.32   6.71
##  8 ende    4.65   6.85
##  9 som     4.54   6.76
## 10 ere     4.37   6.81
## 11 hed     4.32   6.75
## 12 ning    4.24   6.76
## 13 iv      4.20   6.77
## 14 est     3.93   6.76
## 15 eri     3.51   6.84
## 16 bar     3.06   6.80

Now you could easily check for the correlation of both columns and compare that to the correlation of the raw frequencies and reaction times. In other words, you could compare the correlation of word frequency and reaction time by affix and then by word. If you do that, you’ll realize that the correlation is significant for words, but not exactly for affixes at \(\alpha = 0.05.\)

For some reason you don’t want to look at affixes lig, ende and est. You also learned that subject 2s10 knew about your experiment and you prefer to exlude his/her responses.


cleanData = new %>% 
  filter(Subject != "2s10" & !Affix %in% c("lig", "ende", "est"))

cleanData
## # A tibble: 2,583 x 8
##    Subject Word  Affix LogRT PrevError Sex   LogWordFreq LogUP
##    <fct>   <fct> <fct> <dbl> <fct>     <fct>       <dbl> <dbl>
##  1 2s11    arkiv iv     6.72 CORRECT   M            5.36  5.61
##  2 2s09    arkiv iv     6.62 CORRECT   F            5.36  5.61
##  3 2s13    arkiv iv     6.50 CORRECT   F            5.36  5.61
##  4 2s03    arkiv iv     7.49 ERROR     M            5.36  5.61
##  5 2s14    arkiv iv     6.58 CORRECT   M            5.36  5.61
##  6 2s04    arkiv iv     6.84 CORRECT   M            5.36  5.61
##  7 2s02    arkiv iv     6.54 CORRECT   F            5.36  5.61
##  8 2s15    arkiv iv     6.71 CORRECT   M            5.36  5.61
##  9 2s06    arkiv iv     6.42 CORRECT   F            5.36  5.61
## 10 2s08    arkiv iv     6.82 CORRECT   F            5.36  5.61
## # ... with 2,573 more rows

1. Create a new column in new called affixLength. This column will contain the number of letters of each affix. To do that, you will use the str_length() in the stringr package. You can also use the function nchar(), which doesn’t require extra packages. For example, if you type str_length("horse"), R will print 5. Finally, (b) remove the columns Word and Sex. Do (a) and (b) at the same time.

library(stringr)
new %>% mutate(affixLength = str_length(Affix)) %>% select(-c(Word, Sex))
## # A tibble: 3,326 x 7
##    Subject Affix LogRT PrevError LogWordFreq LogUP affixLength
##    <fct>   <fct> <dbl> <fct>           <dbl> <dbl>       <int>
##  1 2s14    lig    6.45 CORRECT          2.94  5.32           3
##  2 2s17    lig    6.84 CORRECT          2.94  5.32           3
##  3 2s15    lig    6.84 CORRECT          2.94  5.32           3
##  4 2s04    lig    6.83 CORRECT          2.94  5.32           3
##  5 2s06    lig    6.80 CORRECT          2.94  5.32           3
##  6 2s11    lig    7.06 CORRECT          2.94  5.32           3
##  7 2s12    lig    7.47 CORRECT          2.94  5.32           3
##  8 2s21    lig    6.93 CORRECT          2.94  5.32           3
##  9 2s10    lig    6.77 CORRECT          2.94  5.32           3
## 10 2s03    lig    6.71 CORRECT          2.94  5.32           3
## # ... with 3,316 more rows

Let’s say that for some reason you want to know the percentage of correct answers (on the previous trial) by affix. How could you do that?

new %>% 
  group_by(Affix, PrevError) %>% 
  count() %>%
  group_by(Affix) %>%
  mutate(Freq = n/sum(n))
## # A tibble: 32 x 4
## # Groups:   Affix [16]
##    Affix PrevError     n   Freq
##    <fct> <fct>     <int>  <dbl>
##  1 bar   CORRECT     204 0.976 
##  2 bar   ERROR         5 0.0239
##  3 ede   CORRECT     208 0.967 
##  4 ede   ERROR         7 0.0326
##  5 en    CORRECT     185 0.969 
##  6 en    ERROR         6 0.0314
##  7 ende  CORRECT     179 0.947 
##  8 ende  ERROR        10 0.0529
##  9 er    CORRECT     206 0.963 
## 10 er    ERROR         8 0.0374
## # ... with 22 more rows

Make sure you understand why group_by() is used twice in the code above.
Important: You may want to use the complete() function from the tidyr package if you have gaps in your data (you’ll read about that on the tutorial at the end of the course).

Intro to data analysis using R: Practice [2]

Guilherme D. Garcia

guilhermegarcia.github.io

Exploring your data