In this practice, we’ll be exploring the danish data set using dplyr. While thinking about the questions below, it will be helpful to open R and reproduce all the steps in your own computer.
First, load dplyr, tibble (well, this will be loaded with dplyr anyway) and languageR (which contains the danish data set—read about these data here; p. 20). Then, load danish and visualize the first rows of the data.
library(dplyr)
library(tibble)
library(languageR)
data(danish)
# If you now type 'danish' and hit enter, R will print all the 3,000+ rows in 'danish'. R will print up to 10,000 rows when you call a data frame. This is definitely not helpful, so you normally only look at the first rows (e.g., by using head()).
head(danish)
## Subject Word Affix LogRT PC1 PC2 PrevError
## 1 2s14 appetitlig lig 6.454239 0.5438602 -0.2330842 CORRECT
## 2 2s17 appetitlig lig 6.842854 1.1290598 -0.4860440 CORRECT
## 3 2s15 appetitlig lig 6.839958 0.6549460 0.2470086 CORRECT
## 4 2s04 appetitlig lig 6.834507 -1.2602210 -2.4332345 CORRECT
## 5 2s06 appetitlig lig 6.795191 0.1668297 0.4430017 CORRECT
## 6 2s11 appetitlig lig 7.062680 -0.2500449 0.1067010 CORRECT
## Rank Sex ResidSemRating ResidFamSize LogWordFreq LogAffixFreq
## 1 1.1293305 M -1.234894 -1.6035 2.944439 13.40718
## 2 -0.8380413 M -1.234894 -1.6035 2.944439 13.40718
## 3 -0.7572909 M -1.234894 -1.6035 2.944439 13.40718
## 4 0.8503748 M -1.234894 -1.6035 2.944439 13.40718
## 5 0.1015990 F -1.234894 -1.6035 2.944439 13.40718
## 6 0.3732137 M -1.234894 -1.6035 2.944439 13.40718
## LogCUP LogUP LogCUPtoEnd
## 1 6.463029 5.32301 4.304065
## 2 6.463029 5.32301 4.304065
## 3 6.463029 5.32301 4.304065
## 4 6.463029 5.32301 4.304065
## 5 6.463029 5.32301 4.304065
## 6 6.463029 5.32301 4.304065
Now, using as_data_frame(), create a new variable (dan). This will be a tibble equivalent to the data frame danish. Visualize the first rows of dan. What are the main differences you note between a data frame and a tibble here? Remember: as.data.frame() creates a typical data frame, whereas as_data_frame() creates a tibble.
dan = as_data_frame(danish)
dan
## # A tibble: 3,326 x 16
## Subject Word Affix LogRT PC1 PC2 PrevError Rank Sex
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <fct>
## 1 2s14 appe… lig 6.45 0.544 -0.233 CORRECT 1.13 M
## 2 2s17 appe… lig 6.84 1.13 -0.486 CORRECT -0.838 M
## 3 2s15 appe… lig 6.84 0.655 0.247 CORRECT -0.757 M
## 4 2s04 appe… lig 6.83 -1.26 -2.43 CORRECT 0.850 M
## 5 2s06 appe… lig 6.80 0.167 0.443 CORRECT 0.102 F
## 6 2s11 appe… lig 7.06 -0.250 0.107 CORRECT 0.373 M
## 7 2s12 appe… lig 7.47 -0.324 -1.50 CORRECT -0.603 F
## 8 2s21 appe… lig 6.93 0.812 -0.130 CORRECT -0.508 F
## 9 2s10 appe… lig 6.77 -0.241 0.522 CORRECT -0.221 M
## 10 2s03 appe… lig 6.71 -0.147 0.118 CORRECT 0.432 M
## # ... with 3,316 more rows, and 7 more variables: ResidSemRating <dbl>,
## # ResidFamSize <dbl>, LogWordFreq <dbl>, LogAffixFreq <dbl>,
## # LogCUP <dbl>, LogUP <dbl>, LogCUPtoEnd <dbl>
Unlike the typical data frame, a tibble prints the first 10 rows of your data. In addition, it only shows the number of columns that fits on the screen (ommited variables are listed at the bottom of the output). As a result, tibbles tend to be easier to look at/work with. Another important difference is that tibble columns tell you the class of each variable. Here, we know that the first column is a factor (i.e., you don’t have to call str() to check that). At the top, it also tells you the dimensions of the table.
Now, let’s focus on dplyr. The nice thing about this package is that all the things you might want to do can be expressed using simple verbs. For example, select, filter, mutate, summarise, count, arrange. Besides, commands such as n() and n_distinct() make counting much easier.
Using dplyr, create a subset of dan called new which contains only the following variables: "Subject", "Word", "Affix", "LogRT", "PrevError", "Sex", "LogWordFreq", "LogUP".
new = select(dan, c(Subject, Word, Affix, LogRT, PrevError, Sex, LogWordFreq, LogUP))
# Remember: select() focuses on columns; filter focuses on rows.
Which dplyr commands would help you answer the following questions:
How many items are there by Affix? How many by Sex? How many speakers are there?
How many correct responses in preceding trials (PrevError) are there by Affix?
How would you recode (c) above using multiple functions and the %>% operator? Why would you do that…? Hint: Start out by telling R which data you want the functions to work on.
new %>% group_by(Sex) %>% summarise(meanRT = mean(LogRT))
## # A tibble: 2 x 2
## Sex meanRT
## <fct> <dbl>
## 1 F 6.76
## 2 M 6.78
The nice thing about concatenating operations is that you can accomplish numerous things all at once, and your code is more readable, given that a fairly intuitive verb is used in each step.
Using the %>% operator again:
What’s the average word frequency (LogWordFreq) by Affix type? Arrange the affixes in your output starting with the one with highest word frequency.
What’s the standard deviation of LogRT by speaker? Arrange the speakers starting with the one with lowest standard deviation.
# (a)
new %>% group_by(Affix) %>% summarise(meanWF = mean(LogWordFreq)) %>% arrange(desc(meanWF))
## # A tibble: 16 x 2
## Affix meanWF
## <fct> <dbl>
## 1 et 6.56
## 2 ede 6.24
## 3 en 6.00
## 4 er 5.92
## 5 s 5.72
## 6 lig 5.57
## 7 isk 5.32
## 8 ende 4.65
## 9 som 4.54
## 10 ere 4.37
## 11 hed 4.32
## 12 ning 4.24
## 13 iv 4.20
## 14 est 3.93
## 15 eri 3.51
## 16 bar 3.06
# (b)
new %>% group_by(Subject) %>% summarise(meanSD = sd(LogRT)) %>% arrange(meanSD)
## # A tibble: 22 x 2
## Subject meanSD
## <fct> <dbl>
## 1 2s21 0.122
## 2 2s15 0.128
## 3 2s09 0.129
## 4 2s11 0.131
## 5 2s05 0.146
## 6 2s06 0.155
## 7 2s02 0.161
## 8 2s10 0.164
## 9 2s19 0.164
## 10 2s16 0.169
## # ... with 12 more rows
Do you think people are faster in a lexical decision task when words are more frequent? Let’s create a table with the most common affixes (like (a) in question 6). Now, however, we’ll add an extra column with the mean reaction time by affix. In other words, you only need to adapt the code used above. Also, this time assign your output to a variable so we can refer back to it later. Arrange the table in descending order of the mean word frequency, just like above. This sounds much more complicated than it really is.
out = new %>%
group_by(Affix) %>%
summarise(meanWF = mean(LogWordFreq), meanRT = mean(LogRT)) %>%
arrange(desc(meanWF))
out
## # A tibble: 16 x 3
## Affix meanWF meanRT
## <fct> <dbl> <dbl>
## 1 et 6.56 6.76
## 2 ede 6.24 6.74
## 3 en 6.00 6.74
## 4 er 5.92 6.75
## 5 s 5.72 6.80
## 6 lig 5.57 6.73
## 7 isk 5.32 6.71
## 8 ende 4.65 6.85
## 9 som 4.54 6.76
## 10 ere 4.37 6.81
## 11 hed 4.32 6.75
## 12 ning 4.24 6.76
## 13 iv 4.20 6.77
## 14 est 3.93 6.76
## 15 eri 3.51 6.84
## 16 bar 3.06 6.80
Now you could easily check for the correlation of both columns and compare that to the correlation of the raw frequencies and reaction times. In other words, you could compare the correlation of word frequency and reaction time by affix and then by word. If you do that, you’ll realize that the correlation is significant for words, but not exactly for affixes at \(\alpha = 0.05.\)
For some reason you don’t want to look at affixes lig, ende and est. You also learned that subject 2s10 knew about your experiment and you prefer to exlude his/her responses.
cleanData = new %>%
filter(Subject != "2s10" & !Affix %in% c("lig", "ende", "est"))
cleanData
## # A tibble: 2,583 x 8
## Subject Word Affix LogRT PrevError Sex LogWordFreq LogUP
## <fct> <fct> <fct> <dbl> <fct> <fct> <dbl> <dbl>
## 1 2s11 arkiv iv 6.72 CORRECT M 5.36 5.61
## 2 2s09 arkiv iv 6.62 CORRECT F 5.36 5.61
## 3 2s13 arkiv iv 6.50 CORRECT F 5.36 5.61
## 4 2s03 arkiv iv 7.49 ERROR M 5.36 5.61
## 5 2s14 arkiv iv 6.58 CORRECT M 5.36 5.61
## 6 2s04 arkiv iv 6.84 CORRECT M 5.36 5.61
## 7 2s02 arkiv iv 6.54 CORRECT F 5.36 5.61
## 8 2s15 arkiv iv 6.71 CORRECT M 5.36 5.61
## 9 2s06 arkiv iv 6.42 CORRECT F 5.36 5.61
## 10 2s08 arkiv iv 6.82 CORRECT F 5.36 5.61
## # ... with 2,573 more rows
Create a new column in new called affixLength. This column will contain the number of letters of each affix. To do that, you will use the str_length() in the stringr package. You can also use the function nchar(), which doesn’t require extra packages. For example, if you type str_length("horse"), R will print 5. Finally, (b) remove the columns Word and Sex. Do (a) and (b) at the same time.
library(stringr)
new %>% mutate(affixLength = str_length(Affix)) %>% select(-c(Word, Sex))
## # A tibble: 3,326 x 7
## Subject Affix LogRT PrevError LogWordFreq LogUP affixLength
## <fct> <fct> <dbl> <fct> <dbl> <dbl> <int>
## 1 2s14 lig 6.45 CORRECT 2.94 5.32 3
## 2 2s17 lig 6.84 CORRECT 2.94 5.32 3
## 3 2s15 lig 6.84 CORRECT 2.94 5.32 3
## 4 2s04 lig 6.83 CORRECT 2.94 5.32 3
## 5 2s06 lig 6.80 CORRECT 2.94 5.32 3
## 6 2s11 lig 7.06 CORRECT 2.94 5.32 3
## 7 2s12 lig 7.47 CORRECT 2.94 5.32 3
## 8 2s21 lig 6.93 CORRECT 2.94 5.32 3
## 9 2s10 lig 6.77 CORRECT 2.94 5.32 3
## 10 2s03 lig 6.71 CORRECT 2.94 5.32 3
## # ... with 3,316 more rows
Let’s say that for some reason you want to know the percentage of correct answers (on the previous trial) by affix. How could you do that?
new %>%
group_by(Affix, PrevError) %>%
count() %>%
group_by(Affix) %>%
mutate(Freq = n/sum(n))
## # A tibble: 32 x 4
## # Groups: Affix [16]
## Affix PrevError n Freq
## <fct> <fct> <int> <dbl>
## 1 bar CORRECT 204 0.976
## 2 bar ERROR 5 0.0239
## 3 ede CORRECT 208 0.967
## 4 ede ERROR 7 0.0326
## 5 en CORRECT 185 0.969
## 6 en ERROR 6 0.0314
## 7 ende CORRECT 179 0.947
## 8 ende ERROR 10 0.0529
## 9 er CORRECT 206 0.963
## 10 er ERROR 8 0.0374
## # ... with 22 more rows
Make sure you understand why group_by() is used twice in the code above. Important: You may want to use the complete() function from the tidyr package if you have gaps in your data (you’ll read about that on the tutorial at the end of the course).