Exploring your data

In this practice, we’ll be exploring the danish data set using dplyr. While thinking about the questions below, it will be helpful to open R and reproduce all the steps in your own computer.

  1. First, load dplyr, tibble (well, this will be loaded with dplyr anyway) and languageR (which contains the danish data set—read about these data here; p. 20). Then, load danish and visualize the first rows of the data.




  1. Now, using as_data_frame(), create a new variable (dan). This will be a tibble equivalent to the data frame danish. Visualize the first rows of dan. What are the main differences you note between a data frame and a tibble here? Remember: as.data.frame() creates a typical data frame, whereas as_data_frame() creates a tibble.




Now, let’s focus on dplyr. The nice thing about this package is that all the things you might want to do can be expressed using simple verbs. For example, select, filter, mutate, summarise, count, arrange. Besides, commands such as n() and n_distinct() make counting much easier.

  1. Using dplyr, create a subset of dan called new which contains only the following variables: "Subject", "Word", "Affix", "LogRT", "PrevError", "Sex", "LogWordFreq", "LogUP".




  1. Which dplyr commands would help you answer the following questions:
  1. How many items are there by Affix? How many by Sex? How many speakers are there?
  2. How many correct responses in preceding trials (PrevError) are there by Affix?
  3. What is the mean RT (LogRT) by Sex?




  1. How would you recode (c) above using multiple functions and the %>% operator? Why would you do that…?
    Hint: Start out by telling R which data you want the functions to work on.




  1. Using the %>% operator again:
  1. What’s the average word frequency (LogWordFreq) by Affix type? Arrange the affixes in your output starting with the one with highest word frequency.
  2. What’s the standard deviation of LogRT by speaker? Arrange the speakers starting with the one with lowest standard deviation.




  1. Do you think people are faster in a lexical decision task when words are more frequent? Let’s create a table with the most common affixes (like (a) in question 6). Now, however, we’ll add an extra column with the mean reaction time by affix. In other words, you only need to adapt the code used above. Also, this time assign your output to a variable so we can refer back to it later. Arrange the table in descending order of the mean word frequency, just like above. This sounds much more complicated than it really is.




  1. For some reason you don’t want to look at affixes lig, ende and est. You also learned that subject 2s10 knew about your experiment and you prefer to exlude his/her responses.




    1. Create a new column in new called affixLength. This column will contain the number of letters of each affix. To do that, you will use the str_length() in the stringr package. You can also use the function nchar(), which doesn’t require extra packages. For example, if you type str_length("horse"), R will print 5. Finally, (b) remove the columns Word and Sex. Do (a) and (b) at the same time.




  1. Let’s say that for some reason you want to know the percentage of correct answers (on the previous trial) by affix. How could you do that?