Welcome to

library(leaflet)

leaflet(width = "100%", height = 250) %>% 
  addProviderTiles("Esri.NatGeoWorldMap") %>% 
  addPopups(lng = -73.579427, lat = 45.503576,
            "<b>Using maps in R</b>", 
            popupOptions(closeButton = FALSE))

Different types of map layers here. See also the ggmap package.

NOTE: The tutorial below is a little outdated. For a more comprehensive and up-to-date intro to R using the tidyverse family of packages, see my book.

This tutorial provides an introduction to the R language. As a result, it assumes that you have no solid background in R or data analysis. Each of our four meetings covers a fundamental aspect of the language. The main objective is to show how you can read your own data, explore patterns of interest and visualize trends. We will also briefly discuss statistical tests and the syntax of regression models. Current version created: April 2019; Last updated: September 2021

Before we start, you might want to take a look at this comparison of different packages for statistical analysis. This article discusses some trends in the data analysis world, and shows that SPSS and SAS have been declining rapidly in recent years.

Vous trouvez un cours complet de programmation sous R en français ici.

If you are just starting to learn R, it might be useful to consider pros and cons:

Pros

R is a language, so you have much more power and flexibility
It’s open-source
It’s much faster than any other package used in the Social Sciences (such as SPSS)
There are literally thousands of packages available (>6,000 currently)
R has a huge community online, and it keeps getting bigger
R makes it easy to reproduce your analyses
It’s probably the most widely used tool in academia these days (see this)

Cons

Steep learning curve, so time and patience are crucial
R is not user-friendly if you’re coming from SPSS or Excel
Different packages may have different syntax
Using R is like taking pictures in the manual mode: you have much more power and flexibility, but you need to know more about photography

Important: This tutorial summarises the main aspects of what we discuss in class. As a result, this vignette will focus mostly on actual code, not text.

I will add links to other websites or manuals that may be helpful. For starters, here you can find a comprehensive tutorial on R.

Data analysis

Most of the time, doing data analysis involves the fours steps below. Here, we will first focus on the basics (Part 1), and then on steps 2, 3 and 4. Extras are also provided at the end of this page.

Tidy
Explore (part 2)
Visualize (part 3)
Model (part 4)

Part 1: R basics

Topics: variables, matrices, data frames, slice notation, RData, packages, R scripts

R sessions

Usually, the first thing you want to do when you start an R session is (i) to make sure that your workspace is clean and (ii) to specify your working directory.

ls()               # This will show you the objects you have in your workspace
rm(list=ls())      # This will clean the workspace

getwd()            # This will tell you what your working directory is
setwd("~/Desktop") # This will set it to my own desktop

Unless you are doing something very simple, you will want to open a new R script (cmd+N on a Mac), which you can then save and re-run in the future. Your entire analysis can be saved in a single R script, which other people can run (reproduce) in the future.

Calculator

First and foremost, R can be used as a (powerful) calculator. Below, you can see some examples.

seq(100)                        # This will print numbers from 1 to 100

seq(from = 1, to = 100, by = 1) # The same thing, but with explicit arguments

3 + 4

5 - 2

10 / 3

2 ** 3                          # or 2 ^ 3

10 %% 3                         # Modulo ( 10 / 3 = 3; remainder = 1)

sqrt(10)

log10(100)

3**2 - (3 - 4 / (3 + 1)) * sqrt(148 %% 32)

R can also evaluate TRUE/FALSE statements:

5 > 3

## [1] TRUE

nchar("linguistics") < 10  # Is the number of characters in "linguistics" smaller than 10?

## [1] FALSE

10 == 10                   # Is 10 the same as 10? Note that == is different from =

## [1] TRUE

In R, = is used to assign values to variables, just like <- (see below). If you want to state that something is equal to something else, you use == (cf. all.equal() and identical()).

Variables

You can assign values to variables. Your variable name can be (almost) anything. These are three possible ways of assiging a number to a variable called myVariable.

myVariable = 5
myVariable <- 5
5 -> myVariable

If you run myVariable, you’ll get 5. Your workspace now contains one item:

myVariable

## [1] 5

ls()

## [1] "myVariable"

Matrices

To create a matrix in R, you use the command matrix(). To get help with this command, simply type ?matrix().

myMatrix = matrix(c(1,2,3,4,5,6), nrow=2, ncol=3, byrow = TRUE)

myMatrix

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

In the matrix above, we first tell R which elements we want to add to our matrix, followed by the number of rows and columns and, finally, whether or not we wish to order the elements by row.

Vectors

You can also assign multiple values to a variable. To do that, you first need to concatenate (or combine) all the values you want to assign using c(). This will create a vector. If you then want to perform an operation over the entire vector, you can treat the vector as a ‘single unit’. Vectorized operations are quite fast.

Vectors are essential. When you load your data into R (e.g., a spreadsheet), most of the time your columns can be treated as vectors.

x = c(1,2,3,4,5,6,7) # X is now a vector

x ** x               # This will raise each item in x to its own power (e.g., ..., 7^7)

## [1]      1      4     27    256   3125  46656 823543

Variables can contain other types of elements. For example, you may want to have a variable with different types of word order:

wordOrder = c("SOV", "SVO", "VOS", "others")

In this case, wordOrder contains strings (as opposed to numbers, for example). This vector’s class is therefore character.

class(wordOrder)

## [1] "character"

You can access different elements in a vector using slice notation: vector[n], where n represents the position of the item you’re looking for.

wordOrder[2]

## [1] "SVO"

In addition, you can assign names to your vector elements.

names(wordOrder)  # No names yet.

## NULL

names(wordOrder) = c("item1", "item2", "item3", "item4")

names(wordOrder)  # New names

## [1] "item1" "item2" "item3" "item4"

wordOrder         # Note that now you have items and names

##    item1    item2    item3    item4 
##    "SOV"    "SVO"    "VOS" "others"

Because you now have names (or labels) for each item in the vector, you can also call an item by its name instead of its position. This is useful if you have several items in your vector.

wordOrder["item2"]

## item2 
## "SVO"

You can also have a vector of class logical (TRUEs and FALSEs). However, you cannot have a vector with different types of elements. Vectors will coerce its values to make all elements belong to the same class (if possible). For example, if your vector contains numbers and strings, R will coerce the numbers into strings, so that the entire vector belongs to the class character.

mixedVector = c("SOV", 2, "VOS", "others", 7)

mixedVector

## [1] "SOV"    "2"      "VOS"    "others" "7"

class(mixedVector)

## [1] "character"

You can see above that number 2 and 7 are printed as strings: “2” and “7”, and that mixedVector is, in fact, not mixed at all: it’s a vector that contains only strings. If you do want to have different types of elements in your vector, you’ll need a list.

Lists

Simply put, lists are more complex vectors (and they are extremely useful). You can mix different types of elements in a single list. In the example below, we have a list that contains strings, numbers, booleans, a vector, a matrix and another list.

myList = list("A nice sentence", "word", 23, TRUE, 1, FALSE, c("A", "vector", "with", "strings"), matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 3), list("Another list", c(44, 23), TRUE))

## Warning in matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 3): data length differs
## from size of matrix: [6 != 3 x 3]

myList

## [[1]]
## [1] "A nice sentence"
## 
## [[2]]
## [1] "word"
## 
## [[3]]
## [1] 23
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] 1
## 
## [[6]]
## [1] FALSE
## 
## [[7]]
## [1] "A"       "vector"  "with"    "strings"
## 
## [[8]]
##      [,1] [,2] [,3]
## [1,]    1    4    1
## [2,]    2    5    2
## [3,]    3    6    3
## 
## [[9]]
## [[9]][[1]]
## [1] "Another list"
## 
## [[9]][[2]]
## [1] 44 23
## 
## [[9]][[3]]
## [1] TRUE

You can see that myList has 9 entries, each of which may contain different types of elements. Given that several items are included in myList, it may be a good idea to use line breaks to organise your code. Also, let’s add labels to the entries.

myList = list(                   # Now each row is an item in your list
  entry1 = "A nice sentence", 
  entry2 = "word", 
  aNumber = 23, 
  entry4 = TRUE, 
  1, 
  FALSE, 
  c("A", "vector", "with", "strings"), 
  matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 3), 
  anotherList = list("Another list", 
                     aVector = c(firstNumber = 44, secondNumber = 23), TRUE)
  )

## Warning in matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 3): data length differs
## from size of matrix: [6 != 3 x 3]

myList

## $entry1
## [1] "A nice sentence"
## 
## $entry2
## [1] "word"
## 
## $aNumber
## [1] 23
## 
## $entry4
## [1] TRUE
## 
## [[5]]
## [1] 1
## 
## [[6]]
## [1] FALSE
## 
## [[7]]
## [1] "A"       "vector"  "with"    "strings"
## 
## [[8]]
##      [,1] [,2] [,3]
## [1,]    1    4    1
## [2,]    2    5    2
## [3,]    3    6    3
## 
## $anotherList
## $anotherList[[1]]
## [1] "Another list"
## 
## $anotherList$aVector
##  firstNumber secondNumber 
##           44           23 
## 
## $anotherList[[3]]
## [1] TRUE

You can use the indices in the list to access particular items. For example:

myList[[9]][[2]][2] # 2nd element in the the 2nd item in the 9th entry -> 23

## secondNumber 
##           23

myList[[2]]         # "word"

## [1] "word"

Since now we have labels, accessing items is much easier:

myList $ anotherList $ aVector[2]

## secondNumber 
##           23

A list is like an encyclopedia, where different entries may contain several types of data (from text to figures). We could assign labels (i.e., names) to our entries, which would make accessing data much easier (it would really be like a dictionary). Let’s imagine you want to create a corpus using lists. Each entry is a word, which in turn has several entries (meanings, pronunciation, word class etc.). This is a very simple example, but it shows a nice thing about lists, namely, that you can access different dimensions of data directly.

myDict = list(
  nationalisation = list(
    word = "nationalisation",
    class = "N",
    pronunciation = "ˌnæʃ(ə)nələˈzeɪʃən",
    meaning1 = "the transfer of a major branch of industry or commerce from private to state ownership or control",
    meaning2 = NA,
    year = "Unknown",
    derivedForm = matrix(c("nation", "-al", "-iz", "-ation"), 
                         ncol = 4, 
                         dimnames = list(c("word:"), c("n", "adj", "v", "n")))
  ),
  anotherEntry = list(
    word = "another entry",
    class = "another entry",
    pronunciation = "anotherentry",
    meaning1 = "another entry",
    meaning2 = "another entry",
    year = "another entry",
    derivedForm = "another entry"
  )

)

Now, if you wanted to know the pronunciation of the word nationalisation, you could type:

myDict $ nationalisation $ pronunciation

## [1] "ˌnæʃ(ə)nələˈzeɪʃən"

Data frames

If you are planning to use R just to read your data and run some statistical tests, most of the time you will be working with data frames. This is equivalent to your Excel spreadsheet: a table where your columns are different variables. Data frames are not the only class you can use, though. Data tables, for example, are much faster and more flexible, and may be very useful if you have millions and millions of rows—which is usually not the case in Linguistics. I will focus on data frames.

Creating a data frame

You can create a data frame using the command data.frame(). Normally, if you want to create factor levels in a data frame from scratch, you’ll use commands such as gl() to generate levels. Here, I will keep things simple and create a very short data frame.

You will usually import your data into R, so you won’t create a data frame from scratch. However, it’s worth looking into it as an exercise to learn how you could in principle manipulate other objects of interest.

myDataFrame = data.frame(col1 = c(2,4,1,3,8), col2 = c("target", "filler", "filler", "target", "filler"))

myDataFrame

## # A tibble: 5 × 2
##    col1 col2  
##   <dbl> <chr> 
## 1     2 target
## 2     4 filler
## 3     1 filler
## 4     3 target
## 5     8 filler

If you want to access a column in your data, you use $. For example, d$AgeSubject is a vector that contains all the information in the AgeSubject column inside d. If you simply type AgeSubject, R won’t be able to find it, since there’s no object in your workspace with that name, i.e., it’s a column inside another object. You could, however, attach() your data frame d, in which case R would assume that AgeSubject means d$AgeSubject. This seems like a good idea, but it may not help you if you’re analysing several data frames at once.

You can also transform an object into a data frame with the command as.data.frame(). Here, you’ll see that R will create column names for you.

myDataFrame2 = as.data.frame(myList[[8]]) # Let's use the matrix in the list above
myDataFrame2

## # A tibble: 3 × 3
##      V1    V2    V3
##   <dbl> <dbl> <dbl>
## 1     1     4     1
## 2     2     5     2
## 3     3     6     3

Both columns and rows can be manually named in a data frame.

colnames(myDataFrame2) = c("col1", "col2", "col3")
rownames(myDataFrame2) = c("row1", "row2", "row3")

myDataFrame2

## # A tibble: 3 × 3
##    col1  col2  col3
##   <dbl> <dbl> <dbl>
## 1     1     4     1
## 2     2     5     2
## 3     3     6     3

Data frames traditionally have two dimensions (lists can have many, as you can embed matrices and other lists; data tables also allow for multiple dimensions). As a result, you can access any item (cell) as long as you know its column and row. (Nested data frames can have more dimensions, and are incredibly useful. See Extras at the bottom of this page).

myDataFrame2[1,]  # This will print the first row and all columns
myDataFrame2[,1]  # All rows in the first column
myDataFrame2[1,1] # First row, first column
myDataFrame2[,]   # Everything; equivalent to myDataFrame2

Changing variable classes

First, let’s create another data frame. This time, let’s have three columns: one for speaker ID, another for condition, another for response and a final column for reaction time. You’ll notice I use line breaks below, and that the contents of the function data.frame() are aligned. This makes things easier when you are reading code, and is good practice.

df = data.frame(
  speaker = c(1,1,2,2,1),
  condition = c("filler", "target", "target", "filler", "filler"), 
  response = c(0,0,1,0,1),
  RT = c(1.232, 0.987, 1.333, 3.212, 0.765)
  )

df

## # A tibble: 5 × 4
##   speaker condition response    RT
##     <dbl> <chr>        <dbl> <dbl>
## 1       1 filler           0 1.23 
## 2       1 target           0 0.987
## 3       2 target           1 1.33 
## 4       2 filler           0 3.21 
## 5       1 filler           1 0.765

In this case, df is quite small, and you can easily inspect the entire data frame. However, most of the time your data frames will be much larger. In those cases, you don’t want to simply print your data. Rather, you want to summarise the columns and, perhaps, take a look at the first rows to see if everything is in order. You can do this with summary() and head().

summary(df)

##     speaker     condition            response         RT       
##  Min.   :1.0   Length:5           Min.   :0.0   Min.   :0.765  
##  1st Qu.:1.0   Class :character   1st Qu.:0.0   1st Qu.:0.987  
##  Median :1.0   Mode  :character   Median :0.0   Median :1.232  
##  Mean   :1.4                      Mean   :0.4   Mean   :1.506  
##  3rd Qu.:2.0                      3rd Qu.:1.0   3rd Qu.:1.333  
##  Max.   :2.0                      Max.   :1.0   Max.   :3.212

head(df)

## # A tibble: 5 × 4
##   speaker condition response    RT
##     <dbl> <chr>        <dbl> <dbl>
## 1       1 filler           0 1.23 
## 2       1 target           0 0.987
## 3       2 target           1 1.33 
## 4       2 filler           0 3.21 
## 5       1 filler           1 0.765

You probably noticed that R is treating both speaker and response as numbers. As a result, summary(df) is telling you that the mean of speaker is 1.4. Naturally, you want to treat speaker and response as factors, but R doesn’t know that. To change these variables (columns) into factors, you use the function as.factor(). Alternatively, when you want to change the class of multiple columns in the same data frame, you can use the function tranform()—this is what I do below.

# df$speaker = as.factor(df$speakers)            # One alternative
# df$response = as.factor(df$response)

df = transform(df,                               # This is more concise/less repetitive
               speaker = as.factor(speaker),
               response = as.factor(response)
               )

str(df)

## 'data.frame':    5 obs. of  4 variables:
##  $ speaker  : Factor w/ 2 levels "1","2": 1 1 2 2 1
##  $ condition: chr  "filler" "target" "target" "filler" ...
##  $ response : Factor w/ 2 levels "0","1": 1 1 2 1 2
##  $ RT       : num  1.232 0.987 1.333 3.212 0.765

Now that the variables in df are correctly coded, we could proceed with our hypothetical analysis.

Creating columns

To create a column in your data frame:

# This will manually add three types of proficiency

df$proficiency = c("adv", "int", "int", "adv", "beg")

# Alternatively

df["proficiency"] = c("adv", "int", "int", "adv", "beg")

Exporting your data

To export df, you first need to decide which format you want to use. I will assume you use csv, possibly the most common format in these cases. However, you can export your data in virtually any format available. Read more here or here (more advanced).

write.csv(df, file = "df.csv", row.names = FALSE)

This will export df as df.csv, and will omit row names (numbered rows). If you don’t omit row names, you’ll see an extra column is added to your file when you open it in Excel, for example. This extra column will contain row numbers. Finally, note that df.csv will be saved in your working directory, which in this case is my desktop (setwd() above).

Importing your data

Let’s assume you start a new R session and you want to import some data. Because we saved df.csv in the previous step, let’s import that particular file. First, however, I will clean my workspace.

rm(list=ls())             # Clean workspace
ls()                      # Make sure no object is loaded

## character(0)

df = read.csv("df.csv")   # Import df.csv
ls()                      # Checking that df is the only thing in the workspace

## [1] "df"

Now, let’s take a look at the class of each variable in df, like we did above.

str(df)

## 'data.frame':    5 obs. of  5 variables:
##  $ speaker    : int  1 1 2 2 1
##  $ condition  : chr  "filler" "target" "target" "filler" ...
##  $ response   : int  0 0 1 0 1
##  $ RT         : num  1.232 0.987 1.333 3.212 0.765
##  $ proficiency: chr  "adv" "int" "int" "adv" ...

You will notice that both speaker and condition are treated as integer by R, even though we did transform these variables before exporting df. The problem here is that csv files store data in plain text. As a result, these files have no ‘memory’, i.e., no metadata. Consequently, every time you load a csv file, R will assume the default classes for all your variables, regardless of what you had previously transformed.

This is not a problem if you run a script every time you load your data, of course. However, a better solution is to export you data using the RData format (read more about it here).

`RData` vs. `csv`

Let’s transform again the variables in df, then export it as df.RData.

df = transform(df,                               
               speaker = as.factor(speaker),
               response = as.factor(response)
               )

save(df, file = "myData.RData")

Notice that myData.RData contains one object, namely, df. However, you can save as many objects as you want. For example, save(obj1, obj2, obj3, ..., objn, file = "allData.RData") will save one single file that contains several objects. This is extremely useful because you can save your models’ outputs as well as your data objects, for instance, which in turn means you don’t have to re-run models when you start a new session.

Now, let’s clean our workspace, load myData.RData, and check the class of our variables.

rm(list=ls())

load("myData.RData") # This will load df, which is the only object in myData.RData

str(df)

## 'data.frame':    5 obs. of  5 variables:
##  $ speaker    : Factor w/ 2 levels "1","2": 1 1 2 2 1
##  $ condition  : chr  "filler" "target" "target" "filler" ...
##  $ response   : Factor w/ 2 levels "0","1": 1 1 2 1 2
##  $ RT         : num  1.232 0.987 1.333 3.212 0.765
##  $ proficiency: chr  "adv" "int" "int" "adv" ...

Now, you can see that the variables in df are correct. RData is a great way to save all your objects in an intelligent way. Another great advantage is that RData files are compressed, which means they will be much smaller than csv files: read this.

Packages

There are literally thousands of R packages. If you want to perform a certain task, there is probable at least one package for that. If you can’t find a package, you can create your own—or you can create your own scripts with convoluted functions that perform a series of operations.

To see which packages you have installed, simply type installed.packages(). To install a new package, type install.packages("name_of_package"). You may have to specify the repository from which you wish to install a package: install.packages("name_of_package", repos = "http://cran.r-project.org"). Then, to load the package, you type library(name_of_package).

We will use some very important packages during this workshop, so you may want to install them already. We will explore each of these packages during the workshop. Other packages may be suggested along the way.

install.packages(c("dplyr", "ggplot2", "reshape2", "tidyr", "stringr", "scales", "languageR"),
                 repos = "http://cran.r-project.org")

Organizing your files: a suggestion

Your analysis normally has several steps. For example:

Import and clean data
Create additional data frames that summarise patterns
Plot data
Run models and diagnostics

You may have a single script (myAnalysis.R) that contains all items above. This, however, may not be a good idea. First, these items are doing very different things; second, your script will be very long, which makes it harder to understand which part is doing what. Even if you properly comment everything you do (always a good idea), your file may not be as readable as it could be.

Instead, one thing you can do is have separate R files/scripts for your project (which I’ll call X).

cleanData_X.R This will read your data and clean it
summary_X.R summarised data frames and other objects
plots_X.R code for your plots
models_X.R statistical models etc.

As a result, there will be a single script for your plots etc. This will make your code more readable when you come back to your analysis in the future.

OK, but these files have to ‘talk to each other’. Here’s the idea:

cleanData_X.R will read your data and prepare it. It will also load the packages you’ll need, set the working directory etc. In other words, it will contain the basic aspects of your analysis.

summary_X.R will then load cleanData_X.R. To do that, you run source("cleanData_X.R). Then R will load all the variables that you created in cleanData_X.R but it will not open that file. This is exactly what you want.

plots_X.R will load both cleanData_X.R and summary_X.R. As a result, your script will automatically load all the objects that you may wish to plot (again: without opening the actual files where you prepare the data). The same goes for models_X.R.

The rationale behind this is simple: if you want to replot something, you don’t necessarily want to revisit the entire analysis, from importing the raw data to the actual plots. Instead, you just want to plot things. To do that, you just need to open one file (which only has your plots). You just need to source() the required scripts.

You can also save intermediate data frames as your analysis progresses. For example: let’s say you clean your data frame and add a column. If you don’t want to re-do this every time you open the script, you could save a .csv file with that column and load that file instead. Of course, this may seem useless considering that you simply have to run a couple of lines of code, but if you create your column with a for-loop and your data frame has 200,000 rows, you will certainly not want to wait for the looping to end every single time you open your script.

The suggestions above may not be very meaningful at this point, but you might want to develop certain organisational skills from the very beginning. If you are new to R, you probably have no experience writing scripts to analyse your own data. The first scripts you produce are usually messy, and eventually you realise that writing clear and organised code is extremely important (for others to understand your analysis, but also for yourself).

Review

Click here to review some of the topics covered in this part.

Back to the top

Part 2: Exploring your data

Topics: contingency tables, dplyr, data summaries, subsets

First and foremost, before you explore your data, you normally have to tidy it. It’s almost always the case that raw data files are very messy, and you can’t really do much until you’ve organised it in some meaningful way. For example, if you use online surveys, you know the output files are not exactly what you need for plotting, modelling etc.

Here, I will assume you have already cleaned your data and are ready to explore patterns—you can do that with two extremely useful packages: reshape2 and tidyr. Read about them here—also, this vignette has a nice intro to tidyr. You will certainly want to use these packages for organising messy outputs. Also, check the reading list at the end of this tutorial. In particular, take a look at this book.

Let’s load some data. We’ll use the english dataset, which comes with the languageR package. Below, I load the package and the data set, and then assign english to a new variable (d), which will be used as a shortcut.

library(languageR)
data(english)

d = english

The english dataset has several columns (36). You can check the names of all columns by using the names() command.

names(d)

##  [1] "RTlexdec"                        "RTnaming"                       
##  [3] "Familiarity"                     "Word"                           
##  [5] "AgeSubject"                      "WordCategory"                   
##  [7] "WrittenFrequency"                "WrittenSpokenFrequencyRatio"    
##  [9] "FamilySize"                      "DerivationalEntropy"            
## [11] "InflectionalEntropy"             "NumberSimplexSynsets"           
## [13] "NumberComplexSynsets"            "LengthInLetters"                
## [15] "Ncount"                          "MeanBigramFrequency"            
## [17] "FrequencyInitialDiphone"         "ConspelV"                       
## [19] "ConspelN"                        "ConphonV"                       
## [21] "ConphonN"                        "ConfriendsV"                    
## [23] "ConfriendsN"                     "ConffV"                         
## [25] "ConffN"                          "ConfbV"                         
## [27] "ConfbN"                          "NounFrequency"                  
## [29] "VerbFrequency"                   "CV"                             
## [31] "Obstruent"                       "Frication"                      
## [33] "Voice"                           "FrequencyInitialDiphoneWord"    
## [35] "FrequencyInitialDiphoneSyllable" "CorrectLexdec"

Here’s a description of the dataset (provided in the package documentation):

This data set gives mean visual lexical decision latencies and word naming latencies to 2284 monomorphemic English nouns and verbs, averaged for old and young subjects, with various predictor variables.

Note that reaction times have been log-transformed.

Subsetting your data

To simplify things a bit, let’s reduce the number of columns. First, let’s define which columns are of interest. Then, we subset d and select only those columns.

columns = c("RTlexdec", "Familiarity", "Word", "AgeSubject", "LengthInLetters", "CV", "Obstruent", "Voice", "WordCategory")

d = d[, columns]

head(d)

## # A tibble: 6 × 9
##   RTlexdec Familiarity Word   AgeSubject LengthInL…¹ CV    Obstr…² Voice WordC…³
##      <dbl>       <dbl> <fct>  <fct>            <int> <fct> <fct>   <fct> <fct>  
## 1     6.54        2.37 doe    young                3 C     obst    voic… N      
## 2     6.40        4.43 whore  young                5 C     obst    voic… N      
## 3     6.30        5.6  stress young                6 C     obst    voic… N      
## 4     6.42        3.87 pork   young                4 C     obst    voic… N      
## 5     6.45        3.93 plug   young                4 C     obst    voic… N      
## 6     6.53        3.27 prop   young                4 C     obst    voic… N      
## # … with abbreviated variable names ¹LengthInLetters, ²Obstruent, ³WordCategory

This is an example of subsetting your data frame, since we’re reducing the number of columns. You can check the size of your data frame in different ways:

ncol(d)        # Number of columns

## [1] 9

nrow(d)        # Number of rows

## [1] 4568

dim(d)         # Dimension (rows x columns)

## [1] 4568    9

As mentioned before, you can also use summary() to have an overview of your data structure, and str() to check the class of each variable.

But often you want to look at a subset of rows. For example, you might want to examine only young speakers’ data. There are different ways of subsetting your data. The one employed above (d = d[, columns]) uses slice notation. Alternatively, you can use the subset() command.

young = subset(d, AgeSubject == "young")

young = d[d$AgeSubject == "young", ] 

nrow(young) # Note that this is a subset of d

## [1] 2284

The two methods above are equivalent. You may find one more intuitive than the other, but they’re doing the same thing. When we type d[d$AgeSubject == "young", ], we’re telling R:

to look inside d: d[]
to focus on the rows: d[X, ]
to select one particular column: d$AgeSubject
to find all rows where that column is set to young: == "young"

Now we have created a new data frame that only contains data from young speakers. This is not a file yet. Likewise, your original file is still intact: all the changes so far are only real within this particular R session, given this script. You could export young as a csv file, though, and then you wouldn’t need to open and run this script again. If you’re coming from SPSS and Excel, it’s important to know the difference between an object (e.g., a data frame) and an actual file that contains an object (e.g., csv files).

You can have more complicated subsets. For example, you might want data from young speakers that only contain words that begin with a vowel. I will use two lines of code to make it more readable.

youngV = d[d$AgeSubject == "young" & 
           d$CV == "V", ]

nrow(youngV)

## [1] 61

You could keep adding constraints to your subset, either by using & or by using |, which means or (read more about logical operators here). Finally, you may be looking for data that belong to all but a particular group. Instead of asking R for young speakers’ data, you could tell it to give you all data points except for those coming from old speakers. In other words, young is the same as not old in this particular case. The negative of == is !=.

youngV = d[d$AgeSubject == "young" & 
           d$CV == "V", ]

youngV = d[d$AgeSubject != "young" & 
           d$CV != "C", ]

Because we’re dealing with factors that only contain two levels, this makes little sense. But it may be very useful if you have more levels. For example, you may want to exclude some of the words from the data. Let’s say there’s a group of words you do not want. I’m going to call this group badWords. To do this you will need another important operator: %in%. Let’s see how this works.

badWords = c("trade", "war", "vent")     # First, we create a vector with the words

d2 = d[!d$Word %in% badWords, ]

# Alternatively, you could do these two things at once:

d2 = d[!d$Word %in% c("trade", "war", "vent"), ]

# Or perhaps you just don't want "war" in your data:

noWar = d[d$Word != "war", ]

# Which is equivalent to:

noWar = d[!d$Word %in% c("war"), ]

You can see that %in% is quite useful, and its meaning is intuitive. Now, d2 has everything except for those three words. It’s normally a good idea to do things in steps: first, create a vector to hold the words, and then refer back to the vector. This is particularly useful if you want to remove dozens of words, in which case your d[] line would look quite messy.

Finally, you may want to remove NAs from your data. If you want to remove NAs from a particular column, you could type newData = oldData[!is.na(oldData$column), ]. If you wanted to remove any row that has at least one NA in it, you could type newData = oldData[complete.cases(oldData), ], which returns only the rows that have no NAs in them.

Summarising your data

`xtabs()` and `prop.table()`

These are two extremely useful commands in R. Both are used to generate contingency tables. Let’s start with some questions you might have about d.

# 1. How many data points are there by subject age?

xtabs(~AgeSubject, data = d) # Which is equivalent to table(d$AgeSubject)

## AgeSubject
##   old young 
##  2284  2284

# 2. How many data points are there by subject age and by word-initial segment?

xtabs(~AgeSubject + CV, data = d)

##           CV
## AgeSubject    C    V
##      old   2223   61
##      young 2223   61

# 3. Adding one more dimension: WordCategory

xtabs(~AgeSubject + CV + WordCategory, data = d)

## , , WordCategory = N
## 
##           CV
## AgeSubject    C    V
##      old   1405   47
##      young 1405   47
## 
## , , WordCategory = V
## 
##           CV
## AgeSubject    C    V
##      old    818   14
##      young  818   14

You could add more dimensions, but even three categories may be a little confusing to read depending on the number of levels your factors have. Now, we can repeat all three tables above with proportions by using prop.table().

# 1. How many data points are there by subject age?

prop.table(xtabs(~AgeSubject, data = d))

## AgeSubject
##   old young 
##   0.5   0.5

# 2. How many data points are there by subject age and by word-initial segment?

prop.table(xtabs(~AgeSubject + CV, data = d))

##           CV
## AgeSubject          C          V
##      old   0.48664623 0.01335377
##      young 0.48664623 0.01335377

# 3. Adding one more dimension: WordCategory

prop.table(xtabs(~AgeSubject + CV + WordCategory, data = d))

## , , WordCategory = N
## 
##           CV
## AgeSubject           C           V
##      old   0.307574431 0.010288967
##      young 0.307574431 0.010288967
## 
## , , WordCategory = V
## 
##           CV
## AgeSubject           C           V
##      old   0.179071804 0.003064799
##      young 0.179071804 0.003064799

Note that the percentage is calculated across all cells in the table. This is not a problem is you’re examining only one dimension, but you may want to change that. To tell prop.table() which dimension to use as reference for percentages, a second argument is used.

prop.table(xtabs(~AgeSubject + CV, data = d))     # All cells add up to 100%

##           CV
## AgeSubject          C          V
##      old   0.48664623 0.01335377
##      young 0.48664623 0.01335377

prop.table(xtabs(~AgeSubject + CV, data = d), 1)  # Rows add up to 100%

##           CV
## AgeSubject          C          V
##      old   0.97329247 0.02670753
##      young 0.97329247 0.02670753

prop.table(xtabs(~AgeSubject + CV, data = d), 2)  # Columns add up to 100%

##           CV
## AgeSubject   C   V
##      old   0.5 0.5
##      young 0.5 0.5

It’s easy to remember which number to use: count the variables from the left. The first variable is AgeSubject, which means adding 1 to prop.table() will make that particular variable add up to 100%.

prop.table(xtabs(~AgeSubject + CV + WordCategory, data = d), 3) # 100% for WordCategory

## , , WordCategory = N
## 
##           CV
## AgeSubject           C           V
##      old   0.483815427 0.016184573
##      young 0.483815427 0.016184573
## 
## , , WordCategory = V
## 
##           CV
## AgeSubject           C           V
##      old   0.491586538 0.008413462
##      young 0.491586538 0.008413462

Finally, you may want to round up your proportions. You can do that using round(). The first argument is what you want to round up; the second is the number of decimal digits you prefer.

round(prop.table(xtabs(~AgeSubject + CV + WordCategory, data = d), 3), 2)

## , , WordCategory = N
## 
##           CV
## AgeSubject    C    V
##      old   0.48 0.02
##      young 0.48 0.02
## 
## , , WordCategory = V
## 
##           CV
## AgeSubject    C    V
##      old   0.49 0.01
##      young 0.49 0.01

This is a good place to break lines of code. Because you have three functions at once, it may appear confusing at first. Using different lines is helpful:

round(
  prop.table(
    xtabs(~AgeSubject + CV + WordCategory, data = d), 
    3), 
  2)

## , , WordCategory = N
## 
##           CV
## AgeSubject    C    V
##      old   0.48 0.02
##      young 0.48 0.02
## 
## , , WordCategory = V
## 
##           CV
## AgeSubject    C    V
##      old   0.49 0.01
##      young 0.49 0.01

Note that it’s much easier to see which function 2 and 3 refer to: RStudio automatically aligns lines for you (the same goes for text editors such as Sublime).

Naturally, you cand transform your prop.table() into a data frame and then use it to make plots etc.

propData = as.data.frame(round(prop.table(xtabs(~AgeSubject + CV + WordCategory, data = d), 3), 2))

# Which is equivalent to (but messier than):

propData = as.data.frame(
  round(
    prop.table(
      xtabs(~AgeSubject + CV + WordCategory, data = d), 
      3), 
    2)
  )

# Now let's look at propData:

head(propData)

## # A tibble: 6 × 4
##   AgeSubject CV    WordCategory  Freq
##   <fct>      <fct> <fct>        <dbl>
## 1 old        C     N             0.48
## 2 young      C     N             0.48
## 3 old        V     N             0.02
## 4 young      V     N             0.02
## 5 old        C     V             0.49
## 6 young      C     V             0.49

`aggregate()`

One nice way to summarise your data and examine particular patterns is to use aggregate(). For example, you might want to have the mean reaction time by word category and subject age.

dAgg = aggregate(RTlexdec ~ WordCategory + AgeSubject, data = d, FUN = mean)

dAgg

## # A tibble: 4 × 3
##   WordCategory AgeSubject RTlexdec
##   <fct>        <fct>         <dbl>
## 1 N            old            6.66
## 2 V            old            6.65
## 3 N            young          6.44
## 4 V            young          6.43

Read more about this function by typing ?aggregate() in R.

`apply()`

Another important way to summarise your data is to use the apply() family of functions (read more here, here and here). If you want to apply a particular function to a list, you can use lapply().

For example, you could apply range() to the RTlexdec column by subject age. We’ll do that using tapply(), which applies a given function to each cell in an array. range(), as you may guess, returns the min() and max() values of an array.

tapply(d$RTlexdec, d$AgeSubject, range)

## $old
## [1] 6.403193 7.187808
## 
## $young
## [1] 6.205325 6.879150

`dplyr`

Possibly one of the most useful package you’ll use to explore your data, dplyr allows you to easily perform different operations all at once. For example, you can use dplyr instead of xtabs (and prop.table) or aggregate.

First, let’s compare dplyr and xtabs.


library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Using xtabs:
as.data.frame(xtabs(~AgeSubject + CV + WordCategory, data = d))
## # A tibble: 8 × 4
##   AgeSubject CV    WordCategory  Freq
##   <fct>      <fct> <fct>        <int>
## 1 old        C     N             1405
## 2 young      C     N             1405
## 3 old        V     N               47
## 4 young      V     N               47
## 5 old        C     V              818
## 6 young      C     V              818
## 7 old        V     V               14
## 8 young      V     V               14

# Using dplyr:
count(group_by(d, AgeSubject, CV, WordCategory))
## # A tibble: 8 × 4
##   AgeSubject CV    WordCategory     n
##   <fct>      <fct> <fct>        <int>
## 1 old        C     N             1405
## 2 old        C     V              818
## 3 old        V     N               47
## 4 old        V     V               14
## 5 young      C     N             1405
## 6 young      C     V              818
## 7 young      V     N               47
## 8 young      V     V               14

# Alternatively:

# d %>% 
# group_by(AgeSubject, CV, WordCategory) %>%
# count()

Note that the output of dplyr above is already a data frame. As we will see, dplyr can also work in different steps (see %>% below). For example, if we wanted to add the proportions to the output above (i.e., the equivalent to prop.table(xtabs())):


# Using prop.tables and xtabs

as.data.frame(prop.table(xtabs(~AgeSubject + CV + WordCategory, data = d),3))
## # A tibble: 8 × 4
##   AgeSubject CV    WordCategory    Freq
##   <fct>      <fct> <fct>          <dbl>
## 1 old        C     N            0.484  
## 2 young      C     N            0.484  
## 3 old        V     N            0.0162 
## 4 young      V     N            0.0162 
## 5 old        C     V            0.492  
## 6 young      C     V            0.492  
## 7 old        V     V            0.00841
## 8 young      V     V            0.00841

# Using dplyr

count(group_by(d, AgeSubject, CV, WordCategory)) %>% 
group_by(WordCategory) %>%
mutate(Freq = n/sum(n))
## # A tibble: 8 × 5
##   AgeSubject CV    WordCategory     n    Freq
##   <fct>      <fct> <fct>        <int>   <dbl>
## 1 old        C     N             1405 0.484  
## 2 old        C     V              818 0.492  
## 3 old        V     N               47 0.0162 
## 4 old        V     V               14 0.00841
## 5 young      C     N             1405 0.484  
## 6 young      C     V              818 0.492  
## 7 young      V     N               47 0.0162 
## 8 young      V     V               14 0.00841

The nice thing about pipelines (%>%) is that you could even add a plot to the end of the sequence. This makes sense, given that you normally transform your data frame so you can plot it (or model it).

Here are some of the different functions the package has.

arrange()
filter()
group_by() groups data by certain variables
mutate() creates a new column
select() selects columns in your data
summarise()

The name of each function is very intuitive. You could in fact use filter() to subset your data (there are numerous ways of doing the same thing in R). Let’s take a look at some examples.

# First, let's remove (subtract) some columns

dClean = select(d, -c(Voice, Obstruent, Word))

head(dClean)

## # A tibble: 6 × 6
##   RTlexdec Familiarity AgeSubject LengthInLetters CV    WordCategory
##      <dbl>       <dbl> <fct>                <int> <fct> <fct>       
## 1     6.54        2.37 young                    3 C     N           
## 2     6.40        4.43 young                    5 C     N           
## 3     6.30        5.6  young                    6 C     N           
## 4     6.42        3.87 young                    4 C     N           
## 5     6.45        3.93 young                    4 C     N           
## 6     6.53        3.27 young                    4 C     N

# Arranging a data frame by a column

orderedRT = arrange(dClean, RTlexdec)

head(orderedRT, 10)

## # A tibble: 10 × 6
##    RTlexdec Familiarity AgeSubject LengthInLetters CV    WordCategory
##       <dbl>       <dbl> <fct>                <int> <fct> <fct>       
##  1     6.21        4.33 young                    5 C     N           
##  2     6.21        4.33 young                    5 C     V           
##  3     6.21        3.87 young                    5 C     V           
##  4     6.21        1.3  young                    5 C     N           
##  5     6.22        6.3  young                    3 C     N           
##  6     6.22        5.5  young                    3 C     N           
##  7     6.23        4.9  young                    4 C     N           
##  8     6.23        5.5  young                    5 C     N           
##  9     6.24        4.1  young                    4 C     V           
## 10     6.24        5.17 young                    5 C     V

# Arranging by two columns

orderedRT2 = arrange(dClean, LengthInLetters, RTlexdec)

head(orderedRT2, 10)

## # A tibble: 10 × 6
##    RTlexdec Familiarity AgeSubject LengthInLetters CV    WordCategory
##       <dbl>       <dbl> <fct>                <int> <fct> <fct>       
##  1     6.26        6.37 young                    2 C     V           
##  2     6.32        6.47 young                    2 C     N           
##  3     6.32        6.47 young                    2 C     V           
##  4     6.61        6.47 old                      2 C     N           
##  5     6.61        6.47 old                      2 C     V           
##  6     6.61        6.37 old                      2 C     V           
##  7     6.22        6.3  young                    3 C     N           
##  8     6.22        5.5  young                    3 C     N           
##  9     6.24        5.53 young                    3 V     N           
## 10     6.24        5.47 young                    3 C     V

young = filter(dClean, AgeSubject == "young")

head(young)

## # A tibble: 6 × 6
##   RTlexdec Familiarity AgeSubject LengthInLetters CV    WordCategory
##      <dbl>       <dbl> <fct>                <int> <fct> <fct>       
## 1     6.54        2.37 young                    3 C     N           
## 2     6.40        4.43 young                    5 C     N           
## 3     6.30        5.6  young                    6 C     N           
## 4     6.42        3.87 young                    4 C     N           
## 5     6.45        3.93 young                    4 C     N           
## 6     6.53        3.27 young                    4 C     N

Now, let’s work with some more interesting cases. Let’s say you want to know the mean reaction time by word length and by word category. You also want to know how many items there are in each category. Finally, you want to order the resulting data frame by mean reaction time within the categories of interest.

dSum = dClean %>%
       group_by(WordCategory, LengthInLetters) %>%
       summarise(meanRT = mean(RTlexdec), n = n()) %>%
       arrange(WordCategory, meanRT)

## `summarise()` has grouped output by 'WordCategory'. You can override using the
## `.groups` argument.

dSum

## # A tibble: 12 × 4
##    WordCategory LengthInLetters meanRT     n
##    <fct>                  <int>  <dbl> <int>
##  1 N                          2   6.46     2
##  2 N                          3   6.54   474
##  3 N                          5   6.55   896
##  4 N                          4   6.56  1336
##  5 N                          6   6.58   186
##  6 N                          7   6.60    10
##  7 V                          2   6.45     4
##  8 V                          4   6.54   684
##  9 V                          3   6.54   202
## 10 V                          5   6.54   608
## 11 V                          6   6.57   152
## 12 V                          7   6.59    14

Here’s what the functions above do:

Group the data by two variables
Create a new column that summarises two things (RT mean and n)
Arrange data frame by meanRT
Connect all these steps by using %>%

The result is exactly what we want: mean reaction times by word category and word length, arranged according to these two variables. You could now use dSum to make plots and further explore patterns of interest.

It’s important to note that the variables will remain grouped inside the pipeline until you ungroup them by using %>% ungroup(). This allows for even more complex sets of operations, since you could have multiple groupings that rely (or not) on previous groupings.

One known tricky aspect of dplyr is that empty categories are dropped. For example, if a given grouping combination has no values in it, it will not be shown in the final data frame. This may have important consequences if you are calculating proportions, for example. To avoid any possible problems, you can explicitly tell dplyr to output complete data frames. You do that by loading another package, tidyr, and adding an extra line to your pipeline. We don’t need that here, but I’ll add it just as an example.

library(tidyr)

dSum = dClean %>%
       group_by(WordCategory, LengthInLetters) %>%
       summarise(meanRT = mean(RTlexdec), n = n()) %>%
       arrange(WordCategory, meanRT) %>%
       ungroup() %>%
       complete(WordCategory, LengthInLetters) # will add NAs if needed

## `summarise()` has grouped output by 'WordCategory'. You can override using the
## `.groups` argument.

dSum

When you run this, you’ll notice that the output is a ‘tibble’. Tibbles are quite useful to visualize your data: unlike traditional data frames, if you call a tibble, not all rows are printed (10,000 rows are printed when calling a data frame). As well, the columns are collapsed to fit the resolution of your window. To create a tibble, you can use as_data_frame() (using an existing object) or data_frame() (from scratch). To read more about tibbles, check this.

A final example: counting NAs using dplyr. First, let’s create a data frame with some NAs.

df_NAs = data.frame(a = c(1,2,5,NA,3,9,NA),
                b = c("word1", "word2", NA, "word3", "word4", "word5", "word6"),
                c = c(T,T,F,F,T,F,T),
                d = c("A", "B", NA, NA, NA, "E", NA))

Now, let’s count the number of NAs by column:

df_NAs %>% select(everything()) %>% 
  summarise_all(funs(sum(is.na(.))))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

## # A tibble: 1 × 4
##       a     b     c     d
##   <int> <int> <int> <int>
## 1     2     1     0     4

Make sure to read more about other aspects of dplyr here or here.

Back to the top

Tidying your data

`tidyr`

Ideally, you want your columns to represent variables and your rows to represent observations. Let’s see an example where that is not the case; i.e., the data is in ‘wide’ format.


someData = data.frame(
  a = round(rnorm(10, mean = 8, sd = 0.3),1), b = round(rnorm(10, mean = 6),1), subject = rep(c("speaker1", "speaker2"), each = 5)
)

head(someData)
## # A tibble: 6 × 3
##       a     b subject 
##   <dbl> <dbl> <chr>   
## 1   8.6   6.4 speaker1
## 2   8.3   6.6 speaker1
## 3   7.8   7.1 speaker1
## 4   8.4   5.7 speaker1
## 5   8.1   5.9 speaker1
## 6   7.5   5.5 speaker2

In the data above, suppose a and b are different tests, which were taken by subjects speaker1 and speaker2. We want a single column for test, and another column for the scores; i.e., we want to go from ‘wide’ format to ‘long’ format. To do that, we can use the gather() function in the tidyr package (already mentioned a couple of times).


library(tidyr)


longData = gather(someData, 
                  key = test, 
                  value = score, 
                  a:b)


longData
## # A tibble: 20 × 3
##    subject  test  score
##    <chr>    <chr> <dbl>
##  1 speaker1 a       8.6
##  2 speaker1 a       8.3
##  3 speaker1 a       7.8
##  4 speaker1 a       8.4
##  5 speaker1 a       8.1
##  6 speaker2 a       7.5
##  7 speaker2 a       8.1
##  8 speaker2 a       7.5
##  9 speaker2 a       8.1
## 10 speaker2 a       7.7
## 11 speaker1 b       6.4
## 12 speaker1 b       6.6
## 13 speaker1 b       7.1
## 14 speaker1 b       5.7
## 15 speaker1 b       5.9
## 16 speaker2 b       5.5
## 17 speaker2 b       7.7
## 18 speaker2 b       6.1
## 19 speaker2 b       5.3
## 20 speaker2 b       6.4

The opposite of gather() is spread(), but you can read more about that in the documentation.

Back to the top

Review

Click here to review some of the topics covered in this part.

Part 3: Visualizing your data

Topics: exploratory data analysis, ggplot2

Basic plotting

Before we start, this is a good post on why the typical chart selection diagrams. Also, this is a very useful post by Andrew Gelman on data visualization.

There are different packages in R for plotting (e.g., rCharts, lattice, ggplot2, plotly). Here, I will focus on static plots, but you should definitely take a look at interactive plots such as plotly and ggvis as well as Shiny Apps in general. It all depends on which type of plot best serves your analysis. Even though we will be focusing on ggplot2, let’s first take a look at R’s basic plotting functions, which do not require any extra packages (read more here).

Perhaps the simplest type of plot is a histogram. Let’s simulate 1000 numbers that come from a normal distribution and then create a histogram. Below, I use the command set.seed() to ensure that the random numbers generated by rnorm() will always be the same (otherwise, each time you run rnorm() you will get different numbers).

Histograms

set.seed(5)                                         

myNumbers = rnorm(1000, mean = 5, sd = 1)           # Randomly generate 1000 numbers

hist(myNumbers,  
     main = "This is a histogram",                  # Plot title
     xlab = "Number", ylab = "Density",             # Axes labels
     freq = FALSE,                                  # To show the prob. density
     breaks = 22,                                   # Number of bins*
     col = "lightblue")                             # Colour of bins

#* This number is a suggestion only (see ?hist)

While your plot window is open (i.e., active), you can add ‘layers’ to your plot by running additional lines of code. For example, we can add a density curve to the histogram in question. For colour options in R, check documentation. Other websites can help, e.g., this one.

hist(myNumbers,  
     main = "This is a histogram",                  # Plot title
     xlab = "Number", ylab = "Density",             # Axes labels
     freq = FALSE,                                  # To show the prob. density
     breaks = 22,                                   # Number of bins
     col = "lightblue")                             # Colour of bins

curve(dnorm(x, mean = mean(myNumbers), sd = sd(myNumbers)), 
      add = TRUE,                                   # Add curve to active plot
      col = "black",
      lwd = 3)                                      # Line width

Other types of plot are generated with functions such as boxplot() and plot()—in fact, if you use plot() with an object, R will try to plot it with no extra information (try using plot() with xtabs()). For example, you can create a density plot by typing plot(density(myNumbers)). You can find more information about these types of plots here. Let’s create a simple data frame where two columns represent normally distributed values with different means.

set.seed(10)

newData = data.frame(
                     col1 = rnorm(1000, mean = 5, sd = 2),
                     col2 = rnorm(1000, mean = 2, sd = 1)
)

head(newData)      # Let's take a look at the first rows

## # A tibble: 6 × 2
##    col1  col2
##   <dbl> <dbl>
## 1  5.04  3.05
## 2  4.63  2.29
## 3  2.26  2.24
## 4  3.80  2.83
## 5  5.59  1.78
## 6  5.78  2.29

We can take a look at histograms for each column. First, we tell R to open a plot window for two plots. Then, we follow the same procedure shown above.

par(mfrow = c(1,2))   # Organise plots: 1 row, 2 cols

hist(
      newData$col1,
      freq = FALSE,
      col = "white",
      xlab = "Column 1",
      main = "Plot A",
      breaks = 20
    )

curve(dnorm(x, mean = mean(newData$col1), sd = sd(newData$col1)), 
      add = TRUE,
      lwd = 3
     )

hist(
      newData$col2,
      freq = FALSE,
      col = "gray",
      xlab = "Column 2",
      main = "Plot B",
      breaks = 20
)

curve(dnorm(x, mean = mean(newData$col2), sd = sd(newData$col2)), 
      add = TRUE,
      lwd = 3
     )

Boxplots

We can do the same using boxplots.

boxplot(newData$col1, newData$col2,    # A very simple boxplot
        col = "darkseagreen")

We will return to histograms and boxplots below, where I explore in more detail how to adjust your plots.

Read more: This is an interesting place to visit if you’re already familiar with basic plotting in R. If you prefer basic plotting to ggplot2, this website has nice tips on boxplots in particular (explore the website for other plots)—you can also try this.

`ggplot` package

ggplot2 is possibly the most popular package for data visualization in R. Importantly, there is a huge number of tutorials online, and the package’s documentation is excellent. Places like StackOverflow are also great—you will certainly be visiting these websites if you decide to use R.

There are two types of plots you can make using ggplot2 (i.e., two main commands). First, you can use qplot() for simpler plots. Second, you can use ggplot(). We will focus on the latter, but let’s briefly explore qplot().

Let’s load some data. We’ll use the english dataset again (and we’ll select the same columns as last time).

library(languageR)
data(english)

d = english

This is again the description of the dataset (provided in the package documentation):

columns = c("RTlexdec", "Familiarity", "Word", "AgeSubject", "LengthInLetters", "CV", "Obstruent", "Voice")

d = d[, columns]

Finally, before we actually start plotting, we need to load ggplot2.

library(ggplot2)

`qplot()`

Scatterplot

This is an incremental example: qplot(variable1, variable2, data). As a result, we will start with the simplest qplot() possible and then increase its complexity by adding more dimensions.

Let’s plot RTlexdec as a function of Familiarity (and other variables).

qplot(Familiarity, RTlexdec, data = d)

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

We can see that more familiar words yield faster reaction times overall. Now, let’s add colour (conditional on the age of the subjects, which here is a factor with only two levels: old and young)—remember you can type str(d) and summary(d) to check what the data look like.

qplot(Familiarity, RTlexdec, data = d, colour = AgeSubject)

As expected, reaction time is faster among young subjects. What about the number of letters in each word? Let’s add this extra dimension by adjusting the size of each data point according to the lendgh of each word.

qplot(Familiarity, RTlexdec, data = d, colour = AgeSubject, size= LengthInLetters)

This doesn’t look very informative, so let’s drop the size argument for now. In fact, we know nothing about the density of the data, i.e., how much clustering there is. We can add transparency to the data points to solve that.

qplot(Familiarity, RTlexdec, data = d, colour = AgeSubject, alpha = I(0.2))

We could use facets to investigate yet another variable. Let’s look at Obstruent, a factor with two levels (obstr and cont). The question here is whether the sonority of the first phoneme in the word affects reaction time.

qplot(Familiarity, RTlexdec, data = d, colour = AgeSubject, alpha = I(0.2), facets = ~ Obstruent)

Boxplots

If you wanted to compare reaction times across different factor levels using boxplots, you’d add an extra argument to qplot, namely, geom. In this case, geom = "boxplot", but geom can also take vectors, which means you can have multiple plot layers. In this case, let’s examine the influence of Obstruent on reaction time again.

qplot(Obstruent, RTlexdec, data = d, geom = "boxplot")

qplot() can be useful for quick plots, so you may want to use it at times. I will now turn to ggplot(), which is the focus of our discussion on data visualization.

To read more about qplot(), visit this website.

`ggplot()`

ggplot() works with layers (Ls below). Let’s see how to reproduce the scatterplot above using ggplot(). We want to create a scatterplot where the x-axis is Familiarity and the y-axis is RTlexdec. We also want to facet by Obstruent and add color conditional on AgeSubject.

ggplot(data = d, aes(x = Familiarity, y = RTlexdec, colour = AgeSubject)) + # L 1
  geom_point(alpha = 0.2) +                                                 # L 2
  facet_grid(~Obstruent)                                                    # L 3

The first layer, L 1, tells R which data should be plotted, what the axes should be, and the conditional colour. Note that information about the axes goes inside aes(), which stands for aesthetics. The second layer tells R the kind of plot we want to make. geom_point() is the typical scatterplot. The alpha parameter is our transparency. Finally, L 3 tells R that we want to facet the data by Obstruent.

Types of plots

Just like geom_boxplot(), many other geom_s are available:

boxplot: geom_boxplot()
histogram: geom_histogram()
bar plot: geom_bar()
scatter plot: geom_point()
spread points: geom_jitter()
lines: geom_line()
violin plots: geom_violin()
density plots: geom_density()
text plots: geom_text()
pie chart: coord_polar()
rug: geom_rug()
… read more here
To get help, always type ?geom_....

You can also annotate on your plot by using annotate(). Crucially, each ggplot() layer is separated by a plus sign. If you use a different line for each layer, it’s easier to deactivate a particular layer if you need to: you simply use a # in that particular line of code.

In ggplot(), you can control virtually every detail in your plot. The theme() layer is responsible for the overall aesthetics (see below)—there’s a lot of material about it. Read more here.

Naturally, because you can add several layers to your plot, you can end up with a plot that makes no sense (or that has too much information and therefore is not very informative). Let’s create such a plot just to understand a bit more about layers.

ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
  geom_jitter(color = "brown", alpha = 0.3) + 
  geom_boxplot() +
  geom_violin(alpha = 0.3, fill = "blue")

This is a very ugly plot, of course, and it also provides too much information. However, R will not warn you that this is a weird plot: if something is plottable, R will do it. The nice thing about using separate lines for each layer in your plot is that you could ‘turn off’ some layers very easily (but remember to end each line with + so that R knows your plot isn’t over yet):

ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
  #geom_jitter(color = "brown", alpha = 0.3) +               This layer is now inactive
  #geom_boxplot() +
  geom_violin(alpha = 0.3, fill = "blue")

If your data are not normally distributed, using boxplots may be visually misleading (see below). Violin plots, in this case, show the actual distribution of the data points.

Another nice thing about layers is that you can source data from different data frames. When you use geom_point(), R sees no data. It then looks ‘one level up’ and finds ggplot(data=...). As a result, geom_point() inherits whichever data frame you used in the first layer of your plot. However, you could override this by providing a different data frame to geom_point(). In fact, you could have multiple geom_point()s, each of which contains a different data frame. This can be very useful if you are plotting different data frames in a single plot. In sum, you have a lot of flexibility.

ggplot(data = data1, aes(...)) +        # Layer 1
  geom_point() +                        # This will use data1
  geom_point(data = data2, aes(...))    # This will use data2
  geom_violin() +                       # This will use data1 again
  geom_point(data = data3, aes(...))    # You get the idea

Next, let’s make a density plot with a rug at the bottom of the plot. Note that we can also use linebreaks to make a single layer more readable.

ggplot(data = d, aes(x = RTlexdec)) +
  geom_density(fill = "blue", 
               color = "blue",                # Line colour
               alpha = 0.5) + 
  geom_rug()

You can also flip the plot (which may or may not make sense). This is yet another layer:

ggplot(data = d, aes(x = RTlexdec)) +
  geom_density(fill = "blue", 
               color = "blue",                # Line colour
               alpha = 0.5) + 
  geom_rug() +
  coord_flip()

Next, let’s look at geom_boxplot().

ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) + 
  geom_boxplot()

Again: because you can have layers, you can add the actual data points to your plot. Let’s do that by using geom_jitter()—but let’s add some transparency. Also, note the the order of the layers matters: if geom_jitter() precedes geom_boxplot(), the points will be behind the boxplot. Here, because we’ll add geom_jitter() after geom_boxplot(), the points will be in front of the boxplots. Finally, note that we’re not using geom_point() because the x-axis is categorical, which means all the points would be stacked vertically.

ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.1) # This won't be very informative

Let’s add some colour. There are basically two ways of adding colour to your plot: first, you can simply use a particular colour that plays no particular role. Second, you may want to add colour to represent different levels of a factor, for example, in which case the colour itself is informative. If your colour is conditional on some factor, you should add the color to aes(). If you just want things to be colourful, then move the colour outside of aes(). For example:

ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
  geom_jitter(alpha = 0.1, color = "darkseagreen")

In the plot above, we’re just using colour for aesthetic reasons. Now look at the plot below.

ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
  geom_jitter(alpha = 0.3, aes(colour = Obstruent))

Now, we’re telling R to use colour according to a particular factor—which will automatically generate a legend to the right of the plot. R will choose the colours for you (and it doesn’t do a great job most of the time). As a result, you may want to use your own colours: you can even manually create very specific combinations of colours (read more here). There are also palettes you can use, if you are looking for some pre-set combinations of colours.

ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
  geom_jitter(alpha = 0.3, aes(colour = Obstruent)) +
  scale_colour_manual(values = c("black", "red"))

Note that we’re using scale_colour_manual() because we’re setting the colour for the colour() parameter. Imagine we were using bars, and we wanted to fill each bar with a colour that is conditional on some variable. We’d add geom_bar(aes(fill = variable)). In that case, if we wanted to manually set the colour for fill we would use scale_fill_manual(). There are numerous tutorials on how to use colours in ggplot2 (e.g., this one). What you could do is define colours that you might like and always use them. For example, you might want to have plot templates for papers, more colourful ones for presentations and posters etc. In fact, this is not only the case for colours: overall, you probably want to spend some time designing a prototypical plot that satisfies your needs (and taste), and then use/adapt the template for your analyses.

Naturally, the plots above are not aesthetically pleasing—and don’t make much sense, either. The idea here is to show you what you could do. Let’s use facets now to see whether vowel-initial words yield different reaction times relative to consonant-initial words. I will also further personalise the plot by adding axis labels, a title, and adjusting the width of the boxplot, its fill colour and the plot background.

ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
  geom_boxplot(width = 0.5, alpha = 0.3) + 
  facet_grid(~CV) +
  ggtitle("My boxplot\n") +                          # \n will add a line break
  labs(y = "Reaction time\n", x = "\nSubject age") + 
  theme_bw()                                         # White background

By adding \n to your text, you can easily create more room between the plot and your labels. There are other ways you can adjust the spacing, but this is by far the simplest/easiest to remember (it adds exactly one line).

Let’s now go back to violin plots. Above, I mentioned that they can be quite useful when your data are not normally distributed. Let’s see an example. First, let’s create a bimodal distribution. We will do that by first creating two normally distributed vectors and then combining them into a single column in a data frame.

x = rnorm(1000, mean = 5, sd = 0.5)      # First distribution
y = rnorm(1000, mean = 8, sd = 0.5)     # Second distribution

z = data.frame(data=c(x,y), someFactor = gl(1, 2000, 2000, labels = c("Level")))

head(z)

## # A tibble: 6 × 2
##    data someFactor
##   <dbl> <fct>     
## 1  4.47 Level     
## 2  4.32 Level     
## 3  5.04 Level     
## 4  4.55 Level     
## 5  6.83 Level     
## 6  5.56 Level

The data frame above has two columns, namely, data and someFactor, which has a single level (Level). In other words, we’re simulating a group with a bimodal distribution. We can see that it’s bimodal by plotting a histogram.

ggplot(data = z, aes(x = data)) +
  geom_histogram(bins = 30, colour = "white")

If you now plot these data using a boxplot, they will appear to be normally distributed.

ggplot(data = z, aes(y = data, x = someFactor)) + 
  geom_boxplot(width = 0.5, alpha = 0.5)

This can obviously be misleading, because boxplots assume some normality. Violin plots do not:

ggplot(data = z, aes(y = data, x = someFactor)) +
  geom_violin(width = 0.5, alpha = 0.5)

You may also want to use a bar plot to show percentages, even though this is usually not a great use of space, given that typical bars convey little information. Let’s take a look at geom_bar() (which in many cases can give you the same output as geom_histogram()). In this particular case, I’ll add % to the y-axis using the scales package.

library(scales)

ggplot(data = d, aes(x = Obstruent, y = (..count..)/sum(..count..))) +
  geom_bar(width = 0.5) +
  scale_y_continuous(labels = percent_format()) +    # To have % in the y-axis
  labs(x = "Type of segment", y = "%")

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.

Note that you could convey the exact same thing with two dots, one for each percentage in the plot (even though that would require some extra work on the data).

What if you wanted to compare the mean (with standard error bars) reaction time between voiced and voiceless word-initial segments? Here, we’ll use stat_summary(), an very useful function in ggplot(). First, we’ll add error bars. Then, we’ll also add a square to represent the actual means. As mentioned before, you’ll notice that several layers of stat_summary() can be used.

ggplot(data = d, aes(x = Voice, y = RTlexdec)) + 
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.3) +  # Error bars
  stat_summary(fun.y = mean, geom = "point", shape = 15, size = 4)

## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.

By using shape = 15, I’m telling R to plot the means (fun.y = mean) using squares (check the possible shapes here). Read more about stat_summary() here. If you want to use standard deviations instead of standard errors, simply use mean_sdl instead of mean_se. In addition, you can adjust the multiplier by adding fun.args = list(mult = 1) (the default is 2 standard deviations). Finally, for 95% CIs, you can use mean_cl_normal (assuming normality) or mean_cl_boot.

What if you want to plot the reaction time of different words by using the actual words in the plot…? Let’s plot reaction time by familiarity and add two trend lines (linear and non-linear). However, let’s focus on words with three letters only: notice that you can subset the data in the plot.

ggplot(data = d[d$LengthInLetters == 3, ], aes(x = Familiarity, y = RTlexdec)) + 
  geom_text(aes(label = Word), alpha = 0.2) + 
  geom_smooth(colour = "yellow", size = 2, method = "lm") + # Linear trend line
  geom_smooth(colour = "blue", size = 2, method = "loess", linetype = "dashed")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

You can omit the standard error shadow if you add se = F to geom_smooth(). Let’s annotate this plot.

ggplot(data = d[d$LengthInLetters == 3, ], aes(x = Familiarity, y = RTlexdec)) + 
  geom_text(aes(label = Word), alpha = 0.2) + 
  geom_smooth(colour = "yellow", size = 2, method = "lm") + # Linear trend line
  geom_smooth(colour = "blue", size = 2, method = "loess", linetype = "dashed") +
  annotate("text", x = 4, y = 7, label = "Hello", fontface = "bold", size = 7, color = "darkgreen")

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Formatting your plot

People normally underestimate the looks of a plot, but being able to easily understand what the plot is trying to convey can be extremely important. Plots are more user-friendly than tables, and it’s great if your results can be plotted in a clear and interesting way.

One very basic thing that people often forget about is font size. Depending on how you save your plot in R (see below), and on how large the plot will be in your paper/presentation, font size can change dramatically. If you plotted using a large window and the font size looked OK, it may be too small in your paper.

The most important thing you need to remember is that gggplot() has an entire layer to adjust the looks of your plot. As mentioned above, it’s called theme() (read more here). Inside theme() you will specify all the aesthetic details you want. From font face/size and colour to the size of the ticks or the angle of your x-axis labels: everything is done within theme().

To exemplify theme(), I’m going to replot reaction time by voice using errorbars. Let’s see how different the plot will look.

  ggplot(data = d, aes(x = Voice, y = exp(RTlexdec))) +                # back-transform y-axis
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.15, size = 0.35) +  # Error bars
  stat_summary(fun.y = mean, geom = "point", size = 3) +
  theme_bw() +
  theme(text = element_text(size = 12),
        plot.title = element_text(hjust = 0.5)) +                      # Center title
  labs(x = "\nVoicing", y = "Reaction time", title = "A plot") +       # Labels + title
    scale_y_continuous(labels = unit_format(unit = "ms", sep = " "))   # add ms to y-axis

One important thing to remember in this particular case: theme_bw() comes before theme(). In other words, if you want to add a particular theme (from ggthemes; see below), it has to come before theme(). Otherwise, your adjustments within theme() will be ignored.

Several additional parameteres can be changed (check the documentation). I personally like to use CMU Sans Serif because I like to write with CMU Serif using $\LaTeX$, so text and plots look more harmonic. Before choosing your font, however, you may have to install an extra package: extrafont, which may take a little while to install (read here). Finally, remember that plots should make things easier, so it may be useful to assume that ‘less is more’—you might want to check two packages: ggthemes and cowplot.

Reordering axis levels

In many cases, you will want to order your axis labels based on a given variable (there are many reasons why you might wish to do something like that). Here’s one (of many) option: you can create a new column that reorders your levels based on a different variable. In this example, we’ll use reorder().

In this simple example, let’s say you want to plot the mean reaction time by Frication, which has four levels. These levels are automatically sorted alphabetically by R. However, you may want to order them based on the mean RT in each case. As a result, your plot will show a nice ascending pattern.

d = english

fricMean = d %>% group_by(Frication) %>% summarise(meanRT = mean(RTlexdec))

fricMean$Frication2 = reorder(fricMean$Frication, fricMean$meanRT)

ggplot(data = fricMean, aes(x=Frication, y=meanRT)) + 
  geom_point(size = 3) +
  theme_bw() +
  ggtitle("Frication in alphabetical order") +
  labs(x = "Frication", y = "Mean RT")

ggplot(data = fricMean, aes(x=Frication2, y=meanRT)) + 
  geom_point(size = 3) +
  theme_bw() +
  ggtitle("Frication ordered by mean RT") +
  labs(x = "Frication", y = "Mean RT")

# ALSO:

# You could reorder your levels *in the plot* itself as well, by using the reorder() function within aes(). Example: aes(x=reorder(Frication, meanRT), y = meanRT)

Finally, for plotting MCMC models, you can use the ggmcmc package, which prepares the data (using dplyr) for plotting using ggplot2.

Saving your plot

If you are using RStudio, you can click on export (Save as PDF), tick use Cairo_pdf..., and save it. You can also type pdf("fileName.pdf") before your plot and dev.off() right after your plot. Run the first command, then generate your plot, then run the second command. Your plot will be saved as a PDF file in your working directory.

If you are using R, you can also type cmd + s (on a Mac) when the plot window is active. This will save your plot as a pdf, but note that the resolution of your file will be sensitive to the actual size of the window when you save it. As a result, if you generate 10 plots and you want them to have the same size, you cannot alter the size of the window in the process (besides adjusting the actual code of every plot, of course). This is very practical, as long as you remember that the window size affects the size of the plot.

In sum: ggplot2() is a very powerful and flexible package for data visualization. You can adjust virtually every aspect of your plots by using layers of code.

Back to the top

Review

Click here to review some of the topics covered in this part.

Part 4: Statistical analysis

Topics: statistical tests, ANOVA, regression, log-odds, odds, probabilities

The idea here is to focus mostly on the syntax of tests and models, and not on the interpretation of the results (as that would require more than the two hours that we have). You will see that running tests and models in R is quite straight-forward.

We have already seen some of the most basic stats functions that provide some information about our data. For example, mean(), median(), sd(), min(), max(), range() (if your vector contains NAs, however, you need to add na.rm = TRUE to your functions). Of course, you can also type summary() to get a summary of an object (note that this could be your data, your model, a vector etc.). It’s often the case that what you want has a mnemonic function in R, as mentioned above. For example, to print the quantiles of a vector, simply type quantile().

Below, I provide a brief description of some popular statistical tests as well as models you can run in R. First, let’s revisit the english dataset.

library(languageR)

data(english)

d = english

columns = c("RTlexdec", "Word", "AgeSubject", "LengthInLetters", "Familiarity", "Frication", "Voice", "WordCategory")

d = d[, columns]

head(d)

## # A tibble: 6 × 8
##   RTlexdec Word   AgeSubject LengthInLetters Familiarity Frication Voice WordC…¹
##      <dbl> <fct>  <fct>                <int>       <dbl> <fct>     <fct> <fct>  
## 1     6.54 doe    young                    3        2.37 burst     voic… N      
## 2     6.40 whore  young                    5        4.43 frication voic… N      
## 3     6.30 stress young                    6        5.6  frication voic… N      
## 4     6.42 pork   young                    4        3.87 burst     voic… N      
## 5     6.45 plug   young                    4        3.93 burst     voic… N      
## 6     6.53 prop   young                    4        3.27 burst     voic… N      
## # … with abbreviated variable name ¹WordCategory

Statistical tests

Statistical tests are usually not very rich in terms of the amount of information their outputs provide. You probably analyse your data using regression models instead, for example. But it’s nice to start with tests to better understand how R syntax works. Let’s briefly take a look at how you could run some tests.

t-test and Chi-squared

If you know a test, chances are that the R function for that test has a very similar name. For example, if you want to run a t-test, you type t.test(); if you want to run a Chi-squared test of independence, you type chisq.test() (read more about it here). Let’s assume you want to check if the reaction time is different between old and young speakers.

t.test(RTlexdec ~ AgeSubject, data = d)

## 
##  Welch Two Sample t-test
## 
## data:  RTlexdec by AgeSubject
## t = 67.468, df = 4534.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group old and group young is not equal to 0
## 95 percent confidence interval:
##  0.2152787 0.2281642
## sample estimates:
##   mean in group old mean in group young 
##            6.660958            6.439237

# Type ?t.test() to see additional arguments for the function

Manually defining groups: when we use t.test(), we normally have two arguments (groups)—although you can also run a t-test on a single sample, of course. But if you have multiple groups and only want to test two of them, it is useful to create subsets manually. The first group is the vector of reaction times that belong to young speakers. To do that, we first subset (d[d$AgeSubject == "young", ]) and then select a column $RTlexdec (you might want to review subsetting from day 2). This may seem a little confusing, so you could do it in steps.

group1 = d[d$AgeSubject == "young", ] # or d %>% filter(AgeSubject == "young") using dplyr
group2 = d[d$AgeSubject == "old", ] # or d %>% filter(AgeSubject == "old") using dplyr

t.test(group1$RTlexdec, group2$RTlexdec)

## 
##  Welch Two Sample t-test
## 
## data:  group1$RTlexdec and group2$RTlexdec
## t = -67.468, df = 4534.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2281642 -0.2152787
## sample estimates:
## mean of x mean of y 
##  6.439237  6.660958

By default, t.test() will not assume equal variances between the groups (that’s why the degrees of freedom are different from what you’d calculate manually in a typical t-test). If you want to change that, simply add var.equal = TRUE. If you’re running a pairwise t-test, add paired = TRUE to your function. Finally, if your hypothesis is directional, you should specify that by adding alternative = "less" (the default is two.sided, but you can have less or greater). In our data, we already expect older speakers to have slower reaction times.

t.test(RTlexdec ~ AgeSubject, alternative = "greater", data = d)

## 
##  Welch Two Sample t-test
## 
## data:  RTlexdec by AgeSubject
## t = 67.468, df = 4534.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group old and group young is greater than 0
## 95 percent confidence interval:
##  0.2163149       Inf
## sample estimates:
##   mean in group old mean in group young 
##            6.660958            6.439237

Pearson correlation

You may want to check if two variables are correlated. Below, var1 and var2 are obviously negatively correlated. var3, on the other hand, is simply simulated from a normal distribution.

var1 = seq(from = 0.5, to = 100, by = 0.5)  # 200 numbers from 1 to 100
var2 = seq(from = 100, to = 0.5, by = -0.5) # 200 numbers from 100 to 1

var3 = rnorm(200, mean = 1, sd = 0.1)     # 200 random numbers (normally distr)

cor.test(var1, var2)

## 
##  Pearson's product-moment correlation
## 
## data:  var1 and var2
## t = -Inf, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -1 -1
## sample estimates:
## cor 
##  -1

cor.test(var1, var3)

## 
##  Pearson's product-moment correlation
## 
## data:  var1 and var3
## t = 0.93182, df = 198, p-value = 0.3526
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07333642  0.20295730
## sample estimates:
##        cor 
## 0.06607691

In our data frame d, you might want to see if reaction time is correlated with familiarity.

cor.test(d$RTlexdec, d$Familiarity)

## 
##  Pearson's product-moment correlation
## 
## data:  d$RTlexdec and d$Familiarity
## t = -33.522, df = 4566, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4673865 -0.4208330
## sample estimates:
##        cor 
## -0.4444097

ANOVAs

ANOVAs are run using the aov() command in R. There’s also another function, anova(), which is used to compute analysis of variance of one or more fitted models (see below). Let’s focus on the former, which is what people normally want to use.

In d, one of our variables has four levels: Frication.

str(d)

## 'data.frame':    4568 obs. of  8 variables:
##  $ RTlexdec       : num  6.54 6.4 6.3 6.42 6.45 ...
##  $ Word           : Factor w/ 2197 levels "ace","act","add",..: 467 2124 1838 1321 1302 1347 434 468 15 1632 ...
##  $ AgeSubject     : Factor w/ 2 levels "old","young": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LengthInLetters: int  3 5 6 4 4 4 4 3 3 5 ...
##  $ Familiarity    : num  2.37 4.43 5.6 3.87 3.93 3.27 3.73 5.67 3.1 4.43 ...
##  $ Frication      : Factor w/ 4 levels "burst","frication",..: 1 2 2 1 1 1 1 1 3 2 ...
##  $ Voice          : Factor w/ 2 levels "voiced","voiceless": 1 2 2 2 2 2 1 1 1 2 ...
##  $ WordCategory   : Factor w/ 2 levels "N","V": 1 1 1 1 1 1 1 1 1 1 ...

You could be interested in possible differences in reaction time across the different levels of Frication. This time, I will assign the ANOVA to a variable (you will normally do that to keep using the output of a test/model in your analysis). Perhaps before running our ANOVA, we might want to look at a plot (which is what you would normally do in a real case).

library(ggplot2)

ggplot(data = d, aes(x = Frication, y = RTlexdec)) + 
  geom_boxplot()

Clearly, there doesn’t seem to be any difference here. If we now run our ANOVA, we’ll confirm that.

# One-way ANOVA:

myAnova = aov(RTlexdec ~ Frication, data = d)

summary(myAnova)

##               Df Sum Sq Mean Sq F value Pr(>F)
## Frication      3   0.01 0.00179   0.073  0.975
## Residuals   4564 112.45 0.02464

You could perhaps add one more factor, namely, Voice. Note, however, that this is a 2-level factor (the only factor with more than two levels in d is Frication.)

# Two-way ANOVA:

myAnova2 = aov(RTlexdec ~ Frication + Voice, data = d)

summary(myAnova2)

##               Df Sum Sq Mean Sq F value  Pr(>F)   
## Frication      3   0.01 0.00179   0.073 0.97457   
## Voice          1   0.22 0.21722   8.831 0.00298 **
## Residuals   4563 112.23 0.02460                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you wanted to run a post-hoc test such as Tukey HSD to check for different comparisons in Frication (you wouldn’t in this case, of course), you can simply use the TukeyHSD() function in R. You could also plot the differences in means by simply using the plot() function (remember that R will automatically apply plot() to an object—as long as it’s plottable, naturally).

TukeyHSD(myAnova2)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = RTlexdec ~ Frication + Voice, data = d)
## 
## $Frication
##                          diff         lwr        upr     p adj
## frication-burst  0.0022773170 -0.01136649 0.01592113 0.9735233
## long-burst       0.0041207202 -0.03986055 0.04810199 0.9950926
## short-burst      0.0008970196 -0.01504220 0.01683624 0.9989216
## long-frication   0.0018434033 -0.04224657 0.04593338 0.9995563
## short-frication -0.0013802973 -0.01761706 0.01485646 0.9963202
## short-long      -0.0032237006 -0.04807714 0.04162974 0.9977649
## 
## $Voice
##                          diff         lwr           upr     p adj
## voiceless-voiced -0.009998164 -0.01914064 -0.0008556832 0.0320877

plot(TukeyHSD(myAnova2, "Frication"))

Read more about ANOVAs here. Also, type ?aov() to better understand the function itself.

Non-parametric tests

Like t-tests, running other tests is very easy in R. If you Google x test in R, you will certainly come across numerous websites with examples. Before we start, if you want to check for normality:

shapiro.test(d$RTlexdec)

## 
##  Shapiro-Wilk normality test
## 
## data:  d$RTlexdec
## W = 0.98595, p-value < 2.2e-16

qqnorm(d$RTlexdec)
qqline(d$RTlexdec)

Be careful with misconceptions about statistical tests. Read about normality tests here.

For example, if you wanted to run a Wilcox test or a Kruskal Wallis test, you’d type wilcox.test() and kruskal.test(), respectively. Read about them here and here

A comprehensive list of tests and their respective R syntax can be found here.

Linear regression

Simple linear regressions are run using the lm() function in R. You don’t need to load any packages for that. If you want to run a hierarchical linear regression, then you’ll need to load the lme4 package. Unfortunately, we don’t have enough time in this introductory workshop to cover hierarchical models.

Let’s predict reaction time as a function of word familiarity.

lModel1 = lm(RTlexdec ~ Familiarity, data = d)

summary(lModel1)

## 
## Call:
## lm(formula = RTlexdec ~ Familiarity, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49212 -0.11285 -0.00596  0.10569  0.65072 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.780397   0.007178  944.59   <2e-16 ***
## Familiarity -0.060676   0.001810  -33.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1406 on 4566 degrees of freedom
## Multiple R-squared:  0.1975, Adjusted R-squared:  0.1973 
## F-statistic:  1124 on 1 and 4566 DF,  p-value: < 2.2e-16

# You can calculate the confidence intervals by using confint functions such as:

confint.lm(lModel1)

##                  2.5 %      97.5 %
## (Intercept)  6.7663247  6.79446982
## Familiarity -0.0642243 -0.05712723

We can see that Familiarity has a significant effect on RTlexdec: $\beta =$ -0.0607, $p < 0.001$. Each unit of increase in Familiarity lowers RTlexdec by 0.0607. We can replot the data to check that:

ggplot(data = d, aes(x = Familiarity, y = RTlexdec)) + 
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

The negative effect we find in our model represents the negative trend line in our plot above. Let’s add AgeSubject to our model.

lModel2 = lm(RTlexdec ~ Familiarity + AgeSubject, data = d)

summary(lModel2)

## 
## Call:
## lm(formula = RTlexdec ~ Familiarity + AgeSubject, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38126 -0.05907 -0.00418  0.05134  0.53986 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.891258   0.004595 1499.82   <2e-16 ***
## Familiarity     -0.060676   0.001113  -54.52   <2e-16 ***
## AgeSubjectyoung -0.221721   0.002558  -86.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08643 on 4565 degrees of freedom
## Multiple R-squared:  0.6967, Adjusted R-squared:  0.6966 
## F-statistic:  5244 on 2 and 4565 DF,  p-value: < 2.2e-16

This is clearly a better model (with a much higher $R^2$). Given the clear effect of AgeSubject, we wouldn’t want to leave this predictor out. We can see that lModel2 is significantly better than lModel1 by using the anova() function.

anova(lModel2, lModel1)

## # A tibble: 2 × 6
##   Res.Df   RSS    Df `Sum of Sq`     F `Pr(>F)`
##    <dbl> <dbl> <dbl>       <dbl> <dbl>    <dbl>
## 1   4565  34.1    NA        NA     NA        NA
## 2   4566  90.2    -1       -56.1 7515.        0

If you want to to check the residuals of your model:

plot(residuals(lModel2), col = "gray")

Plotting your model’s results

There are several ways to plot a fitted model in R. One nice way is to use the sjPlot package, which is based on ggplot2. As a result, if you’re familiar with the aesthetics of ggplot2, sjPlot will look familiar.

library(sjPlot) # Load package

## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!

plot_model(lModel2)

Because the standard errors are very small, and because the estimates are relatively far from each other on the x-axis, you can’t actually see error bars—but they’re there.

You can adjust several parameters in a sjp plot. Spend some time playing around with this package: you can use it for linear and logistic regression (including hierarchical/mixed-effects models). Read more here.

Logistic regression

To run logistic regressions, you’ll use glm() in R. Let’s predict age group by reaction time. Recall that logistic regressions provide estimates in log-odds.

glmModel1 = glm(AgeSubject ~ RTlexdec, data = d, family = "binomial")

summary(glmModel1)

## 
## Call:
## glm(formula = AgeSubject ~ RTlexdec, family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3951  -0.5312   0.0235   0.4229   3.6178  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 128.6308     3.8008   33.84   <2e-16 ***
## RTlexdec    -19.6497     0.5804  -33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6332.6  on 4567  degrees of freedom
## Residual deviance: 3145.8  on 4566  degrees of freedom
## AIC: 3149.8
## 
## Number of Fisher Scoring iterations: 6

Let’s briefly walk through this

AgeSubject has two levels: old and young
The reference level is old (alphabetical)
As a result, old = 0 and young = 1
So if the estimate is negative, it disfavours young

Our model can be expressed as $log(p/(1-p)) = 128.63 + (-19.64*RT)$. The intercept tells us the log-odds of being in the young group when log(RTlexdec) is zero (which is equivalent to 1ms in the original RT scale, since exp(0) = 1). This is not a meaningful interpretation here given the scale used, so let’s ignore the intercept and focus on what you probably want to know.

Let’s assume a reaction time of 6 (again: remember that this is log-transformed in the english data set). In that case, the model predicts that the log-odds of being in the young group is $128.63 + (-19.64 * 6) = 10.79$. Now let’s assume a reaction time of 7: $128.63 + (-19.64 * 7) = -8.85$. The difference between these two predictions is $-8.85 - (+10.79) = -19.64$, which is exactly the estimate for RTlexdec in the model above. As a result, we can say that 19.64 is the difference in log-odds of p(AgeSubject == young) for a unit increase in the predictor RTlexdec. This difference is equivalent to an odds ratio of approximately $exp(|-19.64|) = 338$.

This means that the probability of being young if your reaction time increases by 1 unit goes down nearly 100% (338/(1+338)). This seems quite radical because the unit increase is too large: the range of reaction time values we have is [6.205325, 7.187808]. As a result, by increasing one unit (from 6 to 7) we’re visiting the extremes of the range (recall that RTlexdec is in the log scale.). Naturally, the interpretation of the results depend on the units in question. Sometimes it will make sense to work with smaller or larger units if you want to report the actual interpretation.

Just so you can see why this makes sense, let’s replot reaction time by subject age.

ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) + 
  geom_boxplot()

Backtransforming your data

You could also backtransform your data. Let’s create a new column that contains the original reaction time values (in ms). Then let’s check the range of RT values we observe in the data.

d$originalRT = exp(d$RTlexdec)

range(d$originalRT)

## [1]  495.38 1323.20

Note that now your units are in ms, which may not be very good for interpreting estimates because it’s too small to matter. Let’s say you want to know how much increase in the probability of being young there is for every 100ms. This would be more meaningful.

d$RT100 = d$originalRT / 100

glmModel2 = glm(AgeSubject ~ RT100, data = d, family = "binomial")

summary(glmModel2)

## 
## Call:
## glm(formula = AgeSubject ~ RT100, family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3244  -0.5309   0.0420   0.4471   3.8918  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 19.37245    0.57048   33.96   <2e-16 ***
## RT100       -2.77268    0.08162  -33.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6332.6  on 4567  degrees of freedom
## Residual deviance: 3199.7  on 4566  degrees of freedom
## AIC: 3203.7
## 
## Number of Fisher Scoring iterations: 6

By dividing originalRT by 100, each new unit represents 100ms. Our new estimate for RT is -2.77 (in log-odds). The negative sign shows the direction of the effect: longer RT lowers the probability of being young. In other words, this means that by adding 100ms to the RT, the odds of being young go down by a factor of exp(2.77) = 15.95. Let’s imagine three people: one with with a reaction time of 500ms, another with 600ms, and another with 1000ms. Then we can ask what the probabilities of being young are for these people based on our model. Note that the difference between person 1 and person 2 is equivalent to a unit of RT100.


person1 = 19.37 + (-2.77*5) # log-odds for person 1

person2 = 19.37 + (-2.77*6)

person3 = 19.37 + (-2.77*10)


person2-person1                # difference in log-odds (=estimate)
## [1] -2.77

exp(person1)                   # person 1 in odds
## [1] 249.635
exp(person1)/(1+exp(person1))  # person 1 in probabilities
## [1] 0.9960101

exp(person2)/(1+exp(person2))  # person 2 in probabilities
## [1] 0.9399133

exp(person3/(1+exp(person3)))  # person 3 in probabilities
## [1] 0.0002416569

Predicting data

You don’t have to do this manually, of course. Let’s imagine you have new speakers with associated RTs. You want to know the changes in probability of being old/young as a function of their RTs. Let’s assume that our model is a good model and let’s use it. Here’s what we are going to do:

Create a new data frame with a bunch of RT values (using RT100)
Get predicted log-odds from our model
Transform these log-odds values into probabilities
Make a plot that shows how probability changes as we increase RT100

newRTs = data.frame(
  RT100 = seq(from = 5, to = 10, by = 0.5) # From 500ms to 1000ms by 50ms
)

preds = predict(glmModel2, newRTs)

probPreds = exp(preds)/(1+exp(preds))


plot(probPreds, type = "b", 
     ylab = "Probability of being young",
     xlab = "Reaction time (1 unit = 100ms)")

Note that above we’ve transformed log-odds into probabilities manually. You don’t need to do that, though. Simply add type = "response" to the predict() function:

newRTs = data.frame(
  RT100 = seq(from = 5, to = 10, by = 0.5) # From 500ms to 1000ms by 50ms
)

preds2 = predict(glmModel2, newRTs, type = "response")

plot(preds2, type = "b", 
     ylab = "Probability of being young",
     xlab = "Reaction time (1 unit = 100ms)")

This is a good tutorial on how to interpret logistic regression results.

Back to the top

Extras

Tidying your data

Wide-to-long transformation

Even though this tutorial does not cover tidying up your data, that’s normally the very first thing you’ll have to do after you import your file into R. There are several important aspects involved in organizing your data frame, but let’s just focus on one aspect (one which I think is crucial). If you’ve used websites to collect data (e.g., SurveyGizmo), you probably know that the output is messy. You normally have one participant per row, which means you’ll have a bunch of columns storing the same kind of data. So if you have 100 questions of the same type, you won’t have a single column storing 100 values (the long format). Instead, you’ll have 100 columns (i.e., the wide format). Needless to say, this is not what you want.

First, let’s see what is meant by a wide data frame. Here’s an example (let’s call this data frame wide1.)

## # A tibble: 3 × 6
##   participant origin     age response_1 response_2 response_3
##   <fct>       <fct>    <dbl>      <dbl>      <dbl>      <dbl>
## 1 subject1    USA         24       4.03       5.24      10.3 
## 2 subject2    Portugal    32       3.93       5.35       9.90
## 3 subject3    Canada      41       4.91       3.68       8.79

You can’t possibly model a response variable that spans over multiple columns. So let’s go from wide to long using the gather() function in the tidyr package.

long1 = wide1 %>% 
    gather(key = stimulus, value = RT,
           response_1:response_3) %>% 
    arrange(participant)

long1

## # A tibble: 9 × 5
##   participant origin     age stimulus      RT
##   <fct>       <fct>    <dbl> <chr>      <dbl>
## 1 subject1    USA         24 response_1  4.03
## 2 subject1    USA         24 response_2  5.24
## 3 subject1    USA         24 response_3 10.3 
## 4 subject2    Portugal    32 response_1  3.93
## 5 subject2    Portugal    32 response_2  5.35
## 6 subject2    Portugal    32 response_3  9.90
## 7 subject3    Canada      41 response_1  4.91
## 8 subject3    Canada      41 response_2  3.68
## 9 subject3    Canada      41 response_3  8.79

Inside gather(), you need to specify the key (the name of the new column that will hold the names of the columns you’re gathering) and the value, i.e., the content column that will hold the actual RT values. Naturally, you could adjust the actual content of the column stimulus (e.g., remove the response_ bits and keep the number). Now let’s go from long to wide using the function spread().

wide_again = long1 %>% 
    spread(key = stimulus, value = RT)

wide_again

## # A tibble: 3 × 6
##   participant origin     age response_1 response_2 response_3
##   <fct>       <fct>    <dbl>      <dbl>      <dbl>      <dbl>
## 1 subject1    USA         24       4.03       5.24      10.3 
## 2 subject2    Portugal    32       3.93       5.35       9.90
## 3 subject3    Canada      41       4.91       3.68       8.79

What if you have two sets of multiple columns that you need to gather? For example, suppose each of your questions is accompanied by a certainty scale.

## # A tibble: 3 × 9
##   participant origin     age response_1 certai…¹ respo…² certa…³ respo…⁴ certa…⁵
##   <fct>       <fct>    <dbl>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 subject1    USA         24       3.15        3    4.10       2    9.86       5
## 2 subject2    Portugal    32       4.35        4    6.35       5    9.49       4
## 3 subject3    Canada      41       3.24        4    6.80       1    8.59       3
## # … with abbreviated variable names ¹certainty_1, ²response_2, ³certainty_2,
## #   ⁴response_3, ⁵certainty_3

This is a trickier case, and there are different ways to do it. I’ll do it in three steps. First, I’ll ignore all columns that contain certainty, and gather() the response columns—just like what was done above. That will be one data frame. Second, I’ll do the same thing, but this time focusing on certainty. Third, I’ll get the result of step 1 and add a new column to it. This column (certainty) will come from the result of step 2. I’ll also arrange the data frame by participant to make it easier to check if the output is correct given the original wide data frame.

# Step 1
long2a = wide2 %>% 
    select(-contains("certainty")) %>% 
    gather(key = stimulus, value = RT, response_1:response_3, factor_key = TRUE)

# Step 2
long2b = wide2 %>% 
    select(-contains("response")) %>% 
    gather(key = stimulus, value = certainty, certainty_1:certainty_3, factor_key = TRUE)

# Step 3
long2 = long2a %>%
    mutate(certainty = long2b$certainty) %>% 
    arrange(participant)

long2

## # A tibble: 9 × 6
##   participant origin     age stimulus      RT certainty
##   <fct>       <fct>    <dbl> <fct>      <dbl>     <dbl>
## 1 subject1    USA         24 response_1  3.15         3
## 2 subject1    USA         24 response_2  4.10         2
## 3 subject1    USA         24 response_3  9.86         5
## 4 subject2    Portugal    32 response_1  4.35         4
## 5 subject2    Portugal    32 response_2  6.35         5
## 6 subject2    Portugal    32 response_3  9.49         4
## 7 subject3    Canada      41 response_1  3.24         4
## 8 subject3    Canada      41 response_2  6.80         1
## 9 subject3    Canada      41 response_3  8.59         3

So all you need to know are gather(), spread(). Other packages also offer similar tools (though they may be less intuitive). Examples include reshape2 and data.table.

Basic functions

Functions are naturally useful pieces of code for obvious reasons. You may want to perform a given sequence of steps in R, and writing a function that does that is a way to automate the whole process. Functions have a similar structure in different languages, and in R you’ll need curly brackets. Let’s suppose you want to create a function that takes a number and returns that same number minus its square root. Let’s define and apply the function.

myFunction = function(x){
  return(x-sqrt(x))
}

myFunction(9)

## [1] 6

myFunction will take your input, the variable x, and perform a given process. Inside the function, you could have any number of things, including additional functions (i.e., a nested/embedded function). You can also personalize your function output, create error messages etc. For example, in this case your input has to be a number. If it’s not, myFunction won’t be able to return an output. You can also print a more interesting result, embedding the output in a sentence.

myFunction = function(x){
  output = paste("Your result is", x-sqrt(x))
  
  return(output)
}

myFunction(9)

## [1] "Your result is 6"

Now, we use the function paste() to concatenate variables and strings, store it in a variable, and return that variable. This makes your function more ‘user-friendly’, so to speak. Any function you use in R has the same essential structure, the only difference being the number of arguments you need to pass to the function (here, only 1) (and, obviously, the complexity of the processes inside the function).

IF-statements

IF-statements are an important kind of function, which logically evaluates a given input (TRUE or FALSE) and performs some action(s) based on that. You can also add an else component, as well as intermediate steps, or if else. Here’s an example: assume we have two variables, x and y. In what follows, we first evaluate if these variables have the same value (they don’t). If they don’t, then we evaluate if x is smaller than y (it isn’t). Finally, it must be the case that x is greater than y, so we print x is greater than y. Note that the curly brackets’ position ensures that the whole expression is interpreted as a single line (it also helps to keep the statements more organized).

x = 10       # Let's define two variables first
y = 5

if(x == y){
  print("x and y are the same") 
} else if(x < y){
  print("x is smaller than y")
} else{
  print("x is greater than y")
}

## [1] "x is greater than y"

Let’s create a function that contains an IF-statment in it. Let’s assume you want a function that tells you wether a number is even or odd. In other words, you give this function a numeric input and it returns one of two sentences, namely, ‘your number is even’ or ‘your number is odd’. To do this, we’ll tell the function to divide our input by 2. If the remaining is 0, the number is even. Else it’s odd. We’ll do this by using the modulo symbol: %%. Let’s call this function myNumber(). I’ll define the function first, and then use it with three examples.

myNumber = function(number){
  if(number %% 2 == 0){
    return('Your number is even')
    } else {return('Your number is odd')
  }
}

# Let's use it with 5 now.

myNumber(5)

## [1] "Your number is odd"

# And with 3870

myNumber(3870)

## [1] "Your number is even"

# And with some expression

myNumber(9478 + 37 * 2)

## [1] "Your number is even"

IF-statements are incredibly useful when you need to reshape your data. For example, you may need to create a new column with values that are dependent on other columns. One alternative to IF-statements is the ifelse() function, which can also be embedded (so you can nest ifelses in one single ifelse() function). Thus, if you know how to create new columns (we saw this in previous posts) and you know how to handle conditionals, you’re ready to make use of both.

Finally, let me show you one example of ifelse().

myNumber2 = function(number){
ifelse(number %% 2 == 0, return('Even'), return('Odd'))
}

myNumber2(3)

## [1] "Odd"

If you need something simple, ifelse() functions are quite useful. Even if you need one intermediate step it’s still ok. For example, ifelse(x, y, ifelse(z, w, a)) tells you that if x is TRUE, then do y. If x is not TRUE, then evaluate the following: if z is TRUE, then do w; otherwise, do a. But you can see that as your level of embedding gets more and more complicated, ifelse() might not be the best way to go.

Loops

Let’s say you want to perform a sequence of actions for every row in your data frame. It may be easy, but doing that for all 1500 rows in your dataset will be very time-consuming. One way to automate the whole process is to use a loop. This is not the best way to automate actions if you have a LOT of data (loops can be quite slow), but if you only have a relatively small dataset it’s a good way to make things work. For now, we won’t be doing that kind of loop. Instead, let’s just work with a simple example: assume you want to create a sequence of 50 numbers. This can be easily done with seq(1:50) in R, so you wouldn’t want a for-loop to do that. But let’s say you want the sequence to follow a specific pattern, namely, each number (from 1 to 50) in the sequence should be added to itself squared. So 1 would be 2; 2 would be 6 and so forth. This is what the following loop does. See comments.

outcome = c() # First, we create an empty vector to store each number in the sequence

for(i in 1:50){
  outcome[length(outcome)+1] = i + i^2 # This adds the number to the empty vector
}

outcome # Let's print the vector and see if the numbers are there

##  [1]    2    6   12   20   30   42   56   72   90  110  132  156  182  210  240
## [16]  272  306  342  380  420  462  506  552  600  650  702  756  812  870  930
## [31]  992 1056 1122 1190 1260 1332 1406 1482 1560 1640 1722 1806 1892 1980 2070
## [46] 2162 2256 2352 2450 2550

If you’re not familiar with coding: Everything inside the for-loop will be done to each number in the sequence. So for each number i in the sequence from 1 to 50, add i + i^2 to outcome. How? Simple: look at the length of outcome and add 1. When it’s empty, its length is zero, so the first number will be added to position 1 (0 + 1). The second time the foor loop runs, the length of outcome will be 1, so the second number generated will be added to position 2 (length of outcome + 1). It’s easy, practical and very powerful, since you can do very intricate things with loops.

Note that outcome is an empty vector at first (so its length is 0). As a result, we have to take that into account inside the loop. However, it’s faster if we already tell R how many “slots” we’ll be using up front (the code is also more readable, given that now we have the indices from the start).

outcome = vector("double", length = 50)

# This creates a vector of length 50. double is simply a type of value (a number).

for(i in seq_along(outcome)){
  outcome[i] = i + i^2
}

# seq_along(x) is the same as 1:length(x). However, if x is empty, seq_along() will return the correct output (which is empty), whereas 1:length(x) will count back from 1 to 0, returning "1 0" instead.

While-loops are also quite important, and follow the same pattern as for-loops, except that you perform a sequence of actions while a given parameter is met. You can read more about loops in general here. The idea here is to introduce you to some basic concepts so you’ll be able to navigate through the thousands of websites on the topic.

Plotting functions

You probably remember plotting quadratic equations when you were in high school. You can naturally do that in R as well: there are some pre-defined functions (distributions) that come with R, but you are also able to create your own function and use it in a plot. We do that by using stat_function(fun = ?) in our ggplot. Simply replace ? with whatever function you’d like. For example, let’s say you have the function $4x^2 - 3x + 9$, and you want to plot it assuming x goes from -5 to +5.

library(ggplot2)
myFunction = function(x){4 * x ^ 2 - 3 * x + 9}
ggplot(data=data.frame(x=seq(-5,5)), aes(x)) + 
  stat_function(fun=myFunction) + 
  theme_bw() + xlab('x-axis') + ylab('y-axis')

You don’t need to define a function before plotting it. Instead, you could do both things at the same time (though it might look a little messier). Now, let’s use a pre-defined function in R: exp. This time, let’s have a dashed line.

ggplot(data=data.frame(x=seq(-5,5)), aes(x)) + 
  stat_function(fun=exp, linetype='dashed') + 
  theme_bw() + xlab('x-axis') + ylab('y-axis')

It’s often useful to create plots that show distributions, so I’ve created a simple script that does that. You can check it out here. For example, you might want to create a simple plot to show what a normal distribution is and how it changes depending on the standard deviations. Here’s an example.

ggplot(data.frame(j=c(-20,20)), aes(x=j)) + 
  stat_function(fun=function(x){dnorm(x,mean=0,sd=1)}, aes(color='A')) +
  stat_function(fun=function(x){dnorm(x,mean=0,sd=2)}, aes(color='B')) +
  stat_function(fun=function(x){dnorm(x,mean=0,sd=3)}, aes(color='C')) +
  stat_function(fun=function(x){dnorm(x,mean=0,sd=4)}, aes(color='D')) +
  stat_function(fun=function(x){dnorm(x,mean=0,sd=5)}, aes(color='E')) + 
  theme_bw() + theme(text=element_text(size=10, vjust=1)) + 
  ggtitle("Normal Distribution") + ylab(NULL) + xlab(NULL) + 
  scale_color_manual('SD', values = c('black', 'red', 'blue', 'grey', 'brown'), labels=c('1', '2', '3', '4', '5'))

The code looks a bit messy, but I basically added as many stat_function() as necessary, so that multiple plots (distributions) can be plotted within the same ggplot(). This looks obvious, but being able to stack different plots onto a single ‘canvas’ is a very important feature in ggplot2.

Back to the top

Nested data frames

Nested data frames can be very useful to keep your data organised in a single data frame. The idea is this: imagine if a cell in your data frame is a list instead of a single value. Lists, as you know, are very powerful, so they can hold anything. As a result, a cell could contain an entire data frame, which is “hidden” until you decide to explore it.

Let’s go through a simple example with the english data set again. Suppose I want to separate the speakers in the data, and separate the data by Frication. Then, I want to run models that predict RTlexdec based on Familiarity, and then I want to compare the $R^2$ values for all these models, to spot some trends. This may not be a model you wish to actually work with, but as a step in your analysis it may be very useful—in particular when you have a lot of data to look at.

First, we have to create a column for speakers, since english has no such a column. Given the description of english in the languageR documentation, I assume there are two speakers (but it doesn’t matter: let’s just simulate them). Each unique word in english is duplicated, so we have pairs of words. Let’s arrange the data by word, and then add sequences of speaker_1 and speaker_2 to a new column, Speaker.


# I'll reload the packages, just to make it more explicit

library(languageR)
library(dplyr)
library(tidyr) # This will be used to nest data frames
library(broom) # This will be used to explore model structures
library(purrr) # This will be used instead of for-loops, basically
library(ggplot2)

data(english) # Let's reload "english", just in case


# Make english a tibble so it's easier to visualize it

english = as_data_frame(english)
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.


# Add speaker column (make sure you follow what's being done below)

eng = english %>% 
  arrange(Word) %>% 
  mutate(Speaker = as.factor(rep(c("speaker_1", "speaker_2"), times = nrow(english)/2)))

Now we have a new column, Speaker. Let’s group everything by Speaker and Frication and hide everything else. You’ll see how clean the output is.


# Nest data frame

byFrication = eng %>% 
  group_by(Speaker, Frication) %>%
  nest() %>%
  arrange(Speaker)


# Now, take a look at what our data look like

byFrication
## # A tibble: 8 × 3
##   Frication Speaker   data               
##   <fct>     <fct>     <list>             
## 1 long      speaker_1 <tibble [44 × 35]> 
## 2 short     speaker_1 <tibble [490 × 35]>
## 3 burst     speaker_1 <tibble [920 × 35]>
## 4 frication speaker_1 <tibble [830 × 35]>
## 5 long      speaker_2 <tibble [44 × 35]> 
## 6 short     speaker_2 <tibble [490 × 35]>
## 7 burst     speaker_2 <tibble [920 × 35]>
## 8 frication speaker_2 <tibble [830 × 35]>


# The column "data" contains several tibbles, which can be accessed the same way you access items in a list.

# Accessing first nested data:
byFrication$data[[1]] # this will be speaker 1, Frication == long
## # A tibble: 44 × 35
##    RTlex…¹ RTnam…² Famil…³ Word  AgeSu…⁴ WordC…⁵ Writt…⁶ Writt…⁷ Famil…⁸ Deriv…⁹
##      <dbl>   <dbl>   <dbl> <fct> <fct>   <fct>     <dbl>   <dbl>   <dbl>   <dbl>
##  1    6.44    6.12    3.8  ace   young   N          4.22  -1.25    0.693  0     
##  2    6.38    6.14    5.53 age   young   N          8.40   0.971   5.05   3.97  
##  3    6.29    6.14    4.37 aid   young   V          6.93   1.91    1.61   0.0188
##  4    6.54    6.14    3.27 aide  young   N          4.62   2.64    0.693  0     
##  5    6.62    6.14    2.3  ail   young   V          4.92   3.64    1.10   0.856 
##  6    6.38    6.11    4.1  aim   young   V          6.60   2.10    1.79   0.327 
##  7    6.24    6.11    5.53 air   young   N          8.41   1.51    4.14   2.33  
##  8    6.50    6.23    4.1  aisle young   N          4.87   1.28    0.693  0     
##  9    6.46    6.12    3.23 ale   young   N          4.04   0.364   1.10   0.198 
## 10    6.37    6.11    3.33 ape   young   N          4.51   1.44    1.79   0.0421
## # … with 34 more rows, 25 more variables: InflectionalEntropy <dbl>,
## #   NumberSimplexSynsets <dbl>, NumberComplexSynsets <dbl>,
## #   LengthInLetters <int>, Ncount <int>, MeanBigramFrequency <dbl>,
## #   FrequencyInitialDiphone <dbl>, ConspelV <int>, ConspelN <dbl>,
## #   ConphonV <int>, ConphonN <dbl>, ConfriendsV <int>, ConfriendsN <dbl>,
## #   ConffV <dbl>, ConffN <dbl>, ConfbV <dbl>, ConfbN <dbl>,
## #   NounFrequency <int>, VerbFrequency <int>, CV <fct>, Obstruent <fct>, …

Now, let’s run different linear models and extract their $R^2$.


# Let's create a column that will hold all the models

byFrication = byFrication %>% 
  mutate(model = map(data, ~ lm(RTlexdec ~ Familiarity, data = .)))

# The map() function applies the lm() function to all items in byFrication$data


# Now, let's extract the R2 and add them to a new column
# Below, we first create a column with info about the model, from which we will derive the R2 values

byFrication = byFrication %>% 
  mutate(
    modelInfo = model %>% map(broom::glance),
    R2 = modelInfo %>% map_dbl("r.squared")
  )

# If you now take a look at the tibble, you'll see our new columns (note that tibbles can ommit certain columns; take a look at the bottom of the output to see which variables are not shown):

byFrication
## # A tibble: 8 × 6
##   Frication Speaker   data                model  modelInfo            R2
##   <fct>     <fct>     <list>              <list> <list>            <dbl>
## 1 long      speaker_1 <tibble [44 × 35]>  <lm>   <tibble [1 × 12]> 0.560
## 2 short     speaker_1 <tibble [490 × 35]> <lm>   <tibble [1 × 12]> 0.462
## 3 burst     speaker_1 <tibble [920 × 35]> <lm>   <tibble [1 × 12]> 0.444
## 4 frication speaker_1 <tibble [830 × 35]> <lm>   <tibble [1 × 12]> 0.431
## 5 long      speaker_2 <tibble [44 × 35]>  <lm>   <tibble [1 × 12]> 0.425
## 6 short     speaker_2 <tibble [490 × 35]> <lm>   <tibble [1 × 12]> 0.347
## 7 burst     speaker_2 <tibble [920 × 35]> <lm>   <tibble [1 × 12]> 0.362
## 8 frication speaker_2 <tibble [830 × 35]> <lm>   <tibble [1 × 12]> 0.344


# Finally, we can plot the R2 and see that for both speakers, the best models (re. R2) are the ones where Frication == long. Also, speaker 1 has a higher R2 across all types of Frication.

byFrication %>% ggplot(aes(x = R2, y = reorder(Frication, R2))) + 
  geom_point(aes(color = Speaker), size = 4) + 
  labs(y = "Frication", x = expression(R^2))


# Note that I have reordered the levels of Frication based on R2 values *in the plot*.

Read more about nested data frames here.

Back to the top

Looking up values

Sometimes, you have multiple files (data frames) to analyse. For example, you may have a spreadsheet with information about your stimuli and another spreadsheet with the actual output of the experiment. You normally want to extract the results from the second data frame and paste it in your first data frame. However, sometimes the row order is different, so you not only need to copy and paste, you also have to match the rows correctly. In Excel you use the lookup() function, for example.

As an example, let’s simulate two data frames: df1 and df2. Let’s say you want to add the origin and the age of each speaker to df1, but these two columns are in data frame df2. We’ll first use match(), and then do the same thing using dplyr(). Let’s look at each data frame.

library(dplyr)

set.seed(12)

df1 = data.frame(
  speaker = rep(paste("speaker_", seq(1:3), sep = ""), each = 3),
  RT = rnorm(9, mean = 0.5, sd = 0.05),
  verb = rep(c("go", "travel", "make"), 3),
  response = sample(c("correct", "incorrect"), size = 9, replace = TRUE)
)

df1

## # A tibble: 9 × 4
##   speaker      RT verb   response 
##   <chr>     <dbl> <chr>  <chr>    
## 1 speaker_1 0.426 go     incorrect
## 2 speaker_1 0.579 travel incorrect
## 3 speaker_1 0.452 make   correct  
## 4 speaker_2 0.454 go     incorrect
## 5 speaker_2 0.400 travel correct  
## 6 speaker_2 0.486 make   incorrect
## 7 speaker_3 0.484 go     incorrect
## 8 speaker_3 0.469 travel incorrect
## 9 speaker_3 0.495 make   correct

df2 = data.frame(
  speaker = rep(paste("speaker_", seq(1:3), sep = ""), each = 1),
  origin = rep(c("Canada", "US", "Italy"), each = 1),
  age = rep(c(25, 37, 18), each = 1)
)

df2

## # A tibble: 3 × 3
##   speaker   origin   age
##   <chr>     <chr>  <dbl>
## 1 speaker_1 Canada    25
## 2 speaker_2 US        37
## 3 speaker_3 Italy     18

Naturally, these two data frames are equally arranged, which quite often is not the case. To make things more realistic, let’s order df1 by RT, so that the data frames are not ordered the same way.

df1 = df1 %>% arrange(RT)

copy_df1 = df1 # Create copy of df1 (see below)

df1

## # A tibble: 9 × 4
##   speaker      RT verb   response 
##   <chr>     <dbl> <chr>  <chr>    
## 1 speaker_2 0.400 travel correct  
## 2 speaker_1 0.426 go     incorrect
## 3 speaker_1 0.452 make   correct  
## 4 speaker_2 0.454 go     incorrect
## 5 speaker_3 0.469 travel incorrect
## 6 speaker_3 0.484 go     incorrect
## 7 speaker_2 0.486 make   incorrect
## 8 speaker_3 0.495 make   correct  
## 9 speaker_1 0.579 travel incorrect

Now, to add origin and age to df1, you can use the match() function:

# Adding the columns in df2 to df1 using match():

df1$origin = df2$origin[match(df1$speaker, df2$speaker)]
df1$age = df2$age[match(df1$speaker, df2$speaker)]

df1

## # A tibble: 9 × 6
##   speaker      RT verb   response  origin   age
##   <chr>     <dbl> <chr>  <chr>     <chr>  <dbl>
## 1 speaker_2 0.400 travel correct   US        37
## 2 speaker_1 0.426 go     incorrect Canada    25
## 3 speaker_1 0.452 make   correct   Canada    25
## 4 speaker_2 0.454 go     incorrect US        37
## 5 speaker_3 0.469 travel incorrect Italy     18
## 6 speaker_3 0.484 go     incorrect Italy     18
## 7 speaker_2 0.486 make   incorrect US        37
## 8 speaker_3 0.495 make   correct   Italy     18
## 9 speaker_1 0.579 travel incorrect Canada    25

Now using dplyr(): simply add left_join(). Let’s add the columns in df2 to copy_df1.

new_df1 = copy_df1 %>% left_join(df2, by = "speaker")

new_df1

## # A tibble: 9 × 6
##   speaker      RT verb   response  origin   age
##   <chr>     <dbl> <chr>  <chr>     <chr>  <dbl>
## 1 speaker_2 0.400 travel correct   US        37
## 2 speaker_1 0.426 go     incorrect Canada    25
## 3 speaker_1 0.452 make   correct   Canada    25
## 4 speaker_2 0.454 go     incorrect US        37
## 5 speaker_3 0.469 travel incorrect Italy     18
## 6 speaker_3 0.484 go     incorrect Italy     18
## 7 speaker_2 0.486 make   incorrect US        37
## 8 speaker_3 0.495 make   correct   Italy     18
## 9 speaker_1 0.579 travel incorrect Canada    25

Clearly, the dplyr option is easier to read. You can also use the data.table package, which offers a very flexible syntax.

Back to the top

Working with words

Manipulating strings can be very useful when we analyse linguistic data. Naturally, there are many tools for text mining (see resources below), which can be great for creating corpora from text, for example. In fact, the tm package offers several interesting functions that go beyond text mining, and which can certainly help you ‘clean’ your strings (e.g., getting rid of function words). Here, I will focus on string manipulation, which is perhaps more commonly used in linguistic data analysis among experimental/theoretical linguists.

In particular, I will focus on three things: fundamental functions for exploring strings, bigrams, and regular expressions for pattern recognition.

Fundamental functions

Let’s get started with two objects (variables): a sentence and a word. They share the same class (character).

sentence = "Capybaras are very friendly"
word = "capybara"

Capybaras don’t normally look stressed

Let’s first focus on the variable sentence and check some basic functions.


toupper(sentence)
## [1] "CAPYBARAS ARE VERY FRIENDLY"

tolower(sentence)
## [1] "capybaras are very friendly"

Because most of the time you want to standardise your words, using tolower() may be quite useful.

Note that sentence contains a single string, which is a sentence (so you can’t really access the individual words that easily). As a result, you will often want to extract each word from Capybaras are very friendly.

sentenceWords = strsplit(tolower(sentence), split = " ")[[1]]

sentenceWords

## [1] "capybaras" "are"       "very"      "friendly"

The function strsplit() above has a very important argument (split), which tells R what defines the elements we want to extract. Here, because we want words, we simply tell R to use a space " ". The second important thing here is [[1]]. This is what we’re doing: strsplit() outputs a list, which here will only have one entry. Inside that entry, we’ll find the individual words. If you don’t use [[1]] above, you’ll need to type sentenceWords[[1]][1] to access the first word in the sentence. By including [[1]] above, you can simply type sentenceWords[1], because the variable sentenceWords is already inside the first entry.

If you’re familiar with Python, you will notice that accessing the last segment in a string cannot be achieved by using -1 (this will in fact remove item 1; see below). A secong difference is that we start couting from 1 in R.

sentenceWords[2] # Second word in the sentence
## [1] "are"

sentenceWords[length(sentenceWords)] # Last word
## [1] "friendly"

sentenceWords[-1] # This will remove the first item
## [1] "are"      "very"     "friendly"

sentenceWords[c(4,1)] # This will pick items 4 and 1
## [1] "friendly"  "capybaras"

R offers some functions that concatenate variables and strings (e.g., cat(), paste()). For example:


paste("The variable 'sentenceWords' has", length(sentenceWords), "words")
## [1] "The variable 'sentenceWords' has 4 words"

paste("Some", sentenceWords[1], "make great pets.")
## [1] "Some capybaras make great pets."

paste("a", "b", sep = "-") # You can also add a separator
## [1] "a-b"

What if we want to know the length of each word in a given vector?

library(stringr)

for(i in 1:length(sentenceWords)){
  print(
    paste(
      sentenceWords[i], ": ", str_length(sentenceWords[i]), "letters"
      )
    )
}

## [1] "capybaras :  9 letters"
## [1] "are :  3 letters"
## [1] "very :  4 letters"
## [1] "friendly :  8 letters"

# Alternatively (and more simply):

str_length(sentenceWords)

## [1] 9 3 4 8

I used str_length() from stringr package above, but you could also use nchar() (which requires no packages)—read more about stringr here.

What if you want to add a word to sentenceWords? In fact, what if you want to add any item to a vector…?


# Let's add another word and an NA to our sentence (vector)
sentenceWords[length(sentenceWords)+1] = "animals"

sentenceWords
## [1] "capybaras" "are"       "very"      "friendly"  "animals"

How about our variable word, defined above? We could also split its characters:


wordLetters = strsplit(word, split = "")[[1]]

wordLetters
## [1] "c" "a" "p" "y" "b" "a" "r" "a"

You simply need to tell R to spit by… nothing: "". Also, note that R actually knows the letters in the alphabet:

letters # All letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

sample(letters, 5) # Let's randomly sample 5 letters
## [1] "h" "k" "n" "g" "l"

Bigrams

n-grams can be very useful. As an example, let’s see how you could create your own function that returns the bigrams of a given word (this could naturally be extended to bigrams of a given sentence).

wdBigram = function(x){

    temp = strsplit(x, split = "")[[1]]
    lenTemp = length(temp)
    bigrams = c()

    # Word-initial bigram
    first = paste("#", temp[1], sep = "")

    bigrams[length(bigrams)+1] = first
    
    # Word-internal bigrams 
        for(i in 1:(lenTemp-1)){

            bigrams[length(bigrams)+1] = paste(temp[i], temp[i+1], sep = "")

        }

    # Word-final bigram
    last = paste(temp[length(temp)], "#", sep = "")

    bigrams[length(bigrams)+1] = last


    return(bigrams)
}

Now let’s test it with our variable word.

wdBigram(word)

## [1] "#c" "ca" "ap" "py" "yb" "ba" "ar" "ra" "a#"

Let’s go back to the english data set and analyse our bigrams there.

library(languageR)
data(english)

d = english

head(d$Word) # These are the words used in the experiment

## [1] doe    whore  stress pork   plug   prop  
## 2197 Levels: ace act add age aid aide ail aim air aisle ale angst ant ape ... zoo

words = d$Word

Now that we assigned all the words in the column Word in the english data frame to a new variable (words), we can collect the bigrams in each word and see which bigrams are more common.


# Step 1: create an empty list

allBigrams = list()

# Step 2: create a for-loop to apply wdBigram to all words

for(word in words){
  allBigrams[[length(allBigrams)+1]] = wdBigram(word)
}

# Step 3: unlist allBigrams

allBigrams = unlist(allBigrams)

head(allBigrams, 10)
##  [1] "#d" "do" "oe" "e#" "#w" "wh" "ho" "or" "re" "e#"

Now, let’s take a look at the number of tokens per type:

allBigrams = as.data.frame(allBigrams) # Create a data frame

names(allBigrams) = "bigram" # Let's name the only column we have

count(group_by(allBigrams, bigram)) # These are all the bigrams we have:

## # A tibble: 361 × 2
##    bigram     n
##    <chr>  <int>
##  1 #a        48
##  2 #b       388
##  3 #c       458
##  4 #d       204
##  5 #e        34
##  6 #f       292
##  7 #g       274
##  8 #h       216
##  9 #i        14
## 10 #j        62
## # … with 351 more rows

# Because there's only one variable in allBigrams, we could simplify the count:

count(allBigrams, bigram)

## # A tibble: 361 × 2
##    bigram     n
##    <chr>  <int>
##  1 #a        48
##  2 #b       388
##  3 #c       458
##  4 #d       204
##  5 #e        34
##  6 #f       292
##  7 #g       274
##  8 #h       216
##  9 #i        14
## 10 #j        62
## # … with 351 more rows

This looks promising, but it’s still kind of messy.


# First, let's create another data frame with the output from xtabs

results = count(group_by(allBigrams, bigram))

# Now let's arrange it by the frequency of each bigram

results = arrange(results, desc(n))

# Finally, let's check the 10 most and least frequent bigrams

head(results, 10)
## # A tibble: 10 × 2
##    bigram     n
##    <chr>  <int>
##  1 e#      1096
##  2 #s       830
##  3 t#       654
##  4 #c       458
##  5 #b       388
##  6 k#       366
##  7 #p       344
##  8 h#       344
##  9 d#       318
## 10 l#       306

tail(results, 10)
## # A tibble: 10 × 2
##    bigram     n
##    <chr>  <int>
##  1 ws         2
##  2 wt         2
##  3 xe         2
##  4 xt         2
##  5 yi         2
##  6 yl         2
##  7 yn         2
##  8 yr         2
##  9 yt         2
## 10 za         2


# In fact, let's create a variable for the 15 most frequent bigrams:

popularBigrams = results[1:15,]

head(popularBigrams)
## # A tibble: 6 × 2
##   bigram     n
##   <chr>  <int>
## 1 e#      1096
## 2 #s       830
## 3 t#       654
## 4 #c       458
## 5 #b       388
## 6 k#       366

Finally, you could also calculate the bigram probability of each word by extending the function above (on my website you can find a function that does that).

Regular expressions

An entire workshop could be devoted to regular expressions. This is an excellent introduction, which will show you the most fundamental aspects of regex in R. In what follows, I will simply show you a quick example.

What if you wanted to examine only word-internal bigrams? In other words, you want to exclude all bigrams that contain #. This is an easy pattern to match using regular expressions.

wdInternal = results[!grepl("#", results$bigram), ]

head(wdInternal)

## # A tibble: 6 × 2
##   bigram     n
##   <chr>  <int>
## 1 st       304
## 2 ch       286
## 3 ea       260
## 4 in       232
## 5 re       226
## 6 ar       224

There is a family of functions in R dedicated to regular expressions. In the example above, I use grepl() to show me cases where a given pattern is matched. It returns indices associated with such cases. In other words, it tells me the position of the patterns I’m looking for. As a result, if I don’t want such patterns, I simply tell R to return rows where the pattern is not TRUE by using !. Only rows without # will be returned as a result.

But what if you want word-internal bigrams that only contain consonants? Perhaps you are interested in consonant clusters, for example. Note that we are working with orthography here.

notCons = c("a", "e", "i", "o", "u", "y", "w")

clusters = wdInternal[!grepl(paste(notCons, collapse = "|"), wdInternal$bigram), ]

head(clusters)

## # A tibble: 6 × 2
##   bigram     n
##   <chr>  <int>
## 1 st       304
## 2 ch       286
## 3 sh       194
## 4 ck       142
## 5 ll       130
## 6 th       128

First, we created a variable to hold all values we do not wish to have in our bigrams. Then, we used paste() to combine all these segments using |, which in regex means or (recall this is also used for subsetting). As a result, grepl() will look for a OR e OR i etc. If any of these segments is found, the output is TRUE. Because we used a !, every case where this evaluates to TRUE will be ignored (i.e., removed). What remains is a list with only consonants (also note that we’re using wdInternal as our data frame, so word-initial and -final bigrams are not a problem anymore).

Back to the top

Working with dates

Working with dates may be important (e.g., time series analysis). If you have longitudinal data, it might be useful to see trends as a function of time, for example. But using dates as variables may be tricky. Here, I’ll use as an example the registration form for this workshop: I simply downloaded the form spreadsheet as a csv file and opened it with R.

First, let’s look at what the csv file looks like (I have assigned it to the variable ws). These are the first six rows in the data frame. Note the Timestamp column.

head(ws[,c("Timestamp", "Institution", "Field")])

## # A tibble: 6 × 3
##   Timestamp          Institution Field                               
##   <chr>              <chr>       <chr>                               
## 1 8/19/2016 1:24:41  McGill      Linguistics                         
## 2 8/19/2016 11:02:45 McGill      Communication Sciences and Disorders
## 3 8/19/2016 14:00:01 McGill      Communication Sciences and Disorders
## 4 8/19/2016 21:58:30 McGill      Psychology                          
## 5 8/20/2016 7:45:15  McGill      Linguistics                         
## 6 8/22/2016 9:39:28  McGill      Communication Sciences and Disorders

Here’s what we want to do: we want a plot that shows the number of entries in the form by date. Below, I use a long pipeline to do the following:

select only some of the columns
create another column that only has the date (no time)
remove original date column (Timestamp)
make the new date column, so its class is Date
count the number of entries by date
plot the result

Step 4 is important: when a given object is of type Date, R knows how to interpret it correctly (i.e., as a date, and not, e.g., as a character or as a factor). In the code below, you will see that I’m using scale_x_date() as a ggplot layer. Within that layer, date_breaks tells R the intervals I wish to use in the x-axis, whereas date_labels tells R how to actually display the date. For example, if I’d used %A, the x-axis would contain actual day names (Sunday, Monday etc.). Read more about dates in R here.

library(ggplot2)
library(stringr)
library(dplyr)

ws %>% select(Timestamp, Institution, Field) %>%
  mutate(Date = str_replace(Timestamp, " .*", "")) %>%
  select(-Timestamp) %>%
  mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
  group_by(Date) %>% summarise(n = n()) %>%
  ggplot(aes(x=Date, y=n)) + 
  geom_line(aes(group=1)) + 
  geom_point(size=4) +
  scale_x_date(date_breaks = "3 days", date_labels = "%b, %d") +
  scale_y_continuous(breaks = 1:6) +
  labs(x = "Date", y = "Number of registrations", title = "R workshop") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5))

Note that the actual data frame has a different date format, namely, yyy-mm-dd. Here’s what ws looks like before we plot it.


ws %>% select(Timestamp, Institution, Field) %>%
  mutate(Date = str_replace(Timestamp, " .*", "")) %>%
  select(-Timestamp) %>%
  mutate(Date = as.Date(Date, format = "%m/%d/%Y"))
## # A tibble: 33 × 3
##    Institution Field                                Date      
##    <chr>       <chr>                                <date>    
##  1 McGill      Linguistics                          2016-08-19
##  2 McGill      Communication Sciences and Disorders 2016-08-19
##  3 McGill      Communication Sciences and Disorders 2016-08-19
##  4 McGill      Psychology                           2016-08-19
##  5 McGill      Linguistics                          2016-08-20
##  6 McGill      Communication Sciences and Disorders 2016-08-22
##  7 McGill      Communication Sciences and Disorders 2016-08-22
##  8 Concordia   Education                            2016-08-22
##  9 Concordia   Linguistics                          2016-08-23
## 10 Concordia   Education                            2016-08-24
## # … with 23 more rows

Back to the top

Networks

Several html widgets make it possible to combine R, html and Javascript. One nice example is the networkD3 package, which allows R to make use of the D3 library. The bottom line is: R can be linked to several other nice tools for data visualization, which may or may not be helpful, depending on what data you want to analyse/visualize, and how you plan to present your plots, for example. If you like dynamic and interative (reactive) plots, then you’ll be constrained by online platforms (e.g., your website).

Romance languages

Here’s a very simple example: a network of the genealogy of Romance languages. For a more elaborate example, click here.

library(networkD3)
library(extrafont)

src = c("LATIN", "LATIN",
           "Vulgar Latin", "Vulgar Latin",
           "Continental Romance", "Continental Romance", "Continental Romance",
           "Italo-Western", "Italo-Western", 
           "Eastern Romance", "Eastern Romance",
           
           "Western Romance", "Western Romance",
           "Proto-Italian",
           "Balkan Romance",
           "Ibero-Romance", "Ibero-Romance",
           "Gallo-Romance", "Gallo-Romance",
           
           "Proto-Romanian", "Proto-Romanian",
           "Occitano-Romance", "Occitano-Romance")
  
  
  
target = c("Classical Latin", "Vulgar Latin",
           "Continental Romance", "Sardinian Language",
           "Italo-Western", "African Romance", "Eastern Romance",
           "Western Romance", "Proto-Italian",
           "Balkan Romance", "Dalmatian",
           
           "Ibero-Romance", "Gallo-Romance",
           "Italian",
           "Proto-Romanian",
           "Portuguese", "Spanish",
           "Occitano-Romance", "French",
           
           "Romanian", "Aromanian",
           "Catalan", "Occitan")

romance = data.frame(src, target, stringsAsFactors = FALSE)


simpleNetwork(romance, opacity = 0.8, 
              zoom = FALSE,
              linkDistance = 45,
              nodeColour = c("#A5271D"),
              fontSize = 15,
              charge = -30,
              fontFamily = "CMU Sans Serif",
              width = "100%"
              )

Back to the top

Plotting ordinal data

As usual, there are many different ways to plot ordinal data. One common way is to use stacked bars. As an example, let’s first create our data. Assume we have 10 participants with 10 observations each. They each had to rate the naturalness of a given structure on a 5-point scale. Finally, let’s say these 10 speakers belong to two different groups (for some reason).

First, let’s create our data frame, called df here.

library(ggplot2)
library(dplyr)
library(RColorBrewer)


# STEP 1: create data
# gl() is used for creating factor levels

df = data.frame(
    speaker = gl(n = 10, k = 10, labels = paste("speaker", seq(1:10), sep = " ")),
    group = gl(n = 2, k = 50, labels = c("Group 1", "Group 2")),
    response = as.ordered(sample(seq(1,5), 100, replace = TRUE))
    )

# STEP 2: Plot responses


ggplot(data = df, aes(x = speaker, fill = response)) + 
    geom_bar(stat = "count") +
    facet_grid(~group, scales = "free_x") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
    labs(y = "", x = "") +
    scale_fill_brewer("Response", palette = "OrRd")

The ggplot function above has some important details, e.g., removing the y-axis ticks and labels (you may want to have them, though). The final line in the plot tells R to use a particular colour palette ("OrRd"), which goes from orange to red. To see what other palettes exist, type RColorBrewer::display.brewer.all(). Naturally, you may to install the RColorBrewer package if it’s not installed yet.

A note on step 1 above: I generated responses using the sample() function, which randomly creates a sample (in this case form the sequence that goes from 1 to 5).

Finally, two things you may find useful.

First: If you don’t want to remove y-axis labels and ticks, you just have to delete the two respective lines of code from the ggplot function. You can also add a final layer to specify which breaks you want to use along that axis. In fact, you may have percentages by speaker instead of counts (here counts are easy to read because there are exactly 10 items per speaker). My suggestion is to use dplyr to prepare the data before plotting it if that’s what you want. The reason is that ggplot will calculate percentages over all facets by default, and fixing that is unnecessarily complex. Preparing your data before plotting is a better idea here. Like this:

df2 = df %>% group_by(speaker, response) %>%
  summarise(n = n()) %>%
  mutate(prop = n/sum(n))

Second: The legend in the plot above goes from 1 to 5 (top down), whereas the plot goes the opposite way. This may or may not bother you. I personally prefer to make both legend and plot go the same direction. To do that, you can pass another argument to the scale_fill_brewer() layer. Below, I add such an argument (breaks), and tell it to use as legend breaks the inverse order of the levels in response. Simply put, rev(levels(df$response)) is equivalent to c("5", "4", "3", "2", "1").

This solution is better than releveling the factor (remember this is an ordered factor). In sum, you just want to reorder it in the plot, not in the data.

PS: I’ve also added another argument to scale_fill_brewer(): direction. If you use 1 here, you get the “higher = darker” colour pattern. If you use -1, then the higher end of the scale will be lighter. 1 makes more sense to me here. To get more info on function arguments, always see the documentation by typing ? before a function (or help()).

ggplot(data = df, aes(x = speaker, fill = response)) + 
  geom_bar(stat = "count") +
  facet_grid(~group, scales = "free_x") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(y = "", x = "") +
  scale_fill_brewer("Response", palette = "OrRd", direction = 1, 
                    breaks = rev(levels(df$response))) +
  scale_y_continuous(breaks = seq(from = 0, to = 11, by = 2)) # y-axis breaks

Back to the top

Memory

Sometimes you may want to know how much memory R objects use (mainly if you’re dealing with very large files at once). For example, you might want to know how large a given data frame is so you can anticipate the size of the file you will generate.

In Linguistics, we normally do not work with big data, so our files tend to be very small compared to genetics, for instance. A csv file with more than 100Mb is relatively rare; text files with 1Gb would be quite unusual in a typical lingusitics experiment (remember that RData is way better than csv in this respect as well).

To know how large an object is, you can use the pryr package. There are at least two useful functions in the package: object.size() and mem_used(). The former tells you how large any object is: a data frame, a vector, a function, etc. The latter tells you the memory being used by all the objects in your workspace.

On top of that, you can use the format() function to define the unit of your output. I usually want the output in megabytes, so:


# Let's create an object with 100,000 random numbers
testObject = rnorm(100000)

# Load package
library(pryr)
## 
## Attaching package: 'pryr'
## The following objects are masked from 'package:purrr':
## 
##     compose, partial

object.size(testObject)
## 800048 bytes
format(object.size(testObject), units = "Mb")
## [1] "0.8 Mb"

# Let's check the size of some objects created during this tutorial:

object.size(clusters)
## 20096 bytes
object.size(d)
## 1483368 bytes
object.size(popularBigrams)
## 5816 bytes

# These are all the objects in our workspace:

objects()
##  [1] "allBigrams"     "badWords"       "byFrication"    "clusters"      
##  [5] "columns"        "copy_df1"       "d"              "d2"            
##  [9] "dAgg"           "dClean"         "df"             "df_NAs"        
## [13] "df1"            "df2"            "dSum"           "eng"           
## [17] "english"        "fricMean"       "glmModel1"      "glmModel2"     
## [21] "group1"         "group2"         "i"              "lModel1"       
## [25] "lModel2"        "long1"          "long2"          "long2a"        
## [29] "long2b"         "longData"       "myAnova"        "myAnova2"      
## [33] "myFunction"     "myNumber"       "myNumber2"      "myNumbers"     
## [37] "new_df1"        "newData"        "newRTs"         "notCons"       
## [41] "noWar"          "orderedRT"      "orderedRT2"     "outcome"       
## [45] "person1"        "person2"        "person3"        "popularBigrams"
## [49] "preds"          "preds2"         "probPreds"      "propData"      
## [53] "results"        "romance"        "sentence"       "sentenceWords" 
## [57] "someData"       "src"            "target"         "testObject"    
## [61] "var1"           "var2"           "var3"           "wdBigram"      
## [65] "wdInternal"     "wide_again"     "wide1"          "wide2"         
## [69] "word"           "wordLetters"    "words"          "ws"            
## [73] "x"              "y"              "young"          "youngV"        
## [77] "z"

Finally, to check how much space is being used by all these objects:

mem_used()
## 229 MB

Back to the top

Introduction to data analysis using R

Guilherme D. Garcia

gdgarcia.ca

Data analysis

Part 1: R basics

R sessions

Calculator

Variables

Matrices

Vectors

Lists

Data frames

Creating a data frame

Changing variable classes

Creating columns

Exporting your data

Importing your data

RData vs. csv

Packages

Organizing your files: a suggestion

Review

Part 2: Exploring your data

Subsetting your data

Summarising your data

xtabs() and prop.table()

aggregate()

apply()

dplyr

Tidying your data

tidyr

Review

Part 3: Visualizing your data

Basic plotting

Histograms

Boxplots

ggplot package

qplot()

Scatterplot

Boxplots

ggplot()

Types of plots

Formatting your plot

Reordering axis levels

Saving your plot

Review

Part 4: Statistical analysis

Statistical tests

t-test and Chi-squared

Pearson correlation

ANOVAs

Non-parametric tests

Linear regression

Plotting your model’s results

Logistic regression

Backtransforming your data

Predicting data

Extras

Tidying your data

Wide-to-long transformation

Basic functions

IF-statements

Loops

Plotting functions

Nested data frames

Looking up values

Working with words

Fundamental functions

Bigrams

Regular expressions

Working with dates

Networks

Romance languages

Plotting ordinal data

Memory

Further reading

Books specifically on R

Websites

Text manipulation

Phonetic analysis

Machine learning

`RData` vs. `csv`

`xtabs()` and `prop.table()`

`aggregate()`

`apply()`

`dplyr`

`tidyr`

`ggplot` package

`qplot()`

`ggplot()`