Welcome to
library(leaflet)
leaflet(width = "100%", height = 250) %>%
addProviderTiles("Esri.NatGeoWorldMap") %>%
addPopups(lng = -73.579427, lat = 45.503576,
"<b>Using maps in R</b>",
popupOptions(closeButton = FALSE))
Different types of map layers here.
See also the
ggmap
package.
NOTE: The tutorial below is a little outdated. For a
more comprehensive and up-to-date intro to R using the
tidyverse
family of packages, see my book.
This tutorial provides an introduction to the R language. As a result, it assumes that you have no solid background in R or data analysis. Each of our four meetings covers a fundamental aspect of the language. The main objective is to show how you can read your own data, explore patterns of interest and visualize trends. We will also briefly discuss statistical tests and the syntax of regression models. Current version created: April 2019; Last updated: September 2021
Before we start, you might want to take a look at this comparison of different packages for statistical analysis. This article discusses some trends in the data analysis world, and shows that SPSS and SAS have been declining rapidly in recent years.
Vous trouvez un cours complet de programmation sous R en français ici.
If you are just starting to learn R, it might be useful to consider pros and cons:
Pros
Cons
Important: This tutorial summarises the main aspects of what we discuss in class. As a result, this vignette will focus mostly on actual code, not text.
I will add links to other websites or manuals that may be helpful. For starters, here you can find a comprehensive tutorial on R.
Most of the time, doing data analysis involves the fours steps below. Here, we will first focus on the basics (Part 1), and then on steps 2, 3 and 4. Extras are also provided at the end of this page.
Topics: variables, matrices, data frames, slice
notation, RData
, packages, R scripts
Usually, the first thing you want to do when you start an R session is (i) to make sure that your workspace is clean and (ii) to specify your working directory.
ls() # This will show you the objects you have in your workspace
rm(list=ls()) # This will clean the workspace
getwd() # This will tell you what your working directory is
setwd("~/Desktop") # This will set it to my own desktop
Unless you are doing something very simple, you will want to open a
new R script (cmd+N
on a Mac), which you can then save and
re-run in the future. Your entire analysis can be saved in a single R
script, which other people can run (reproduce) in the future.
Read more about R sessions here.
First and foremost, R can be used as a (powerful) calculator. Below, you can see some examples.
seq(100) # This will print numbers from 1 to 100
seq(from = 1, to = 100, by = 1) # The same thing, but with explicit arguments
3 + 4
5 - 2
10 / 3
2 ** 3 # or 2 ^ 3
10 %% 3 # Modulo ( 10 / 3 = 3; remainder = 1)
sqrt(10)
log10(100)
3**2 - (3 - 4 / (3 + 1)) * sqrt(148 %% 32)
R can also evaluate TRUE/FALSE
statements:
5 > 3
## [1] TRUE
nchar("linguistics") < 10 # Is the number of characters in "linguistics" smaller than 10?
## [1] FALSE
10 == 10 # Is 10 the same as 10? Note that == is different from =
## [1] TRUE
In R, =
is used to assign values to
variables, just like <-
(see below). If you
want to state that something is equal to something else, you use
==
(cf. all.equal()
and
identical()
).
You can assign values to variables. Your variable name can be
(almost) anything. These are three possible ways of
assiging a number to a variable called myVariable
.
myVariable = 5
myVariable <- 5
5 -> myVariable
If you run myVariable
, you’ll get 5. Your workspace now
contains one item:
myVariable
## [1] 5
ls()
## [1] "myVariable"
To create a matrix in R, you use the command matrix()
.
To get help with this command, simply type ?matrix()
.
myMatrix = matrix(c(1,2,3,4,5,6), nrow=2, ncol=3, byrow = TRUE)
myMatrix
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
In the matrix above, we first tell R which elements we want to add to our matrix, followed by the number of rows and columns and, finally, whether or not we wish to order the elements by row.
You can also assign multiple values to a variable. To do that, you
first need to concatenate (or combine) all the values you want to assign
using c()
. This will create a vector. If
you then want to perform an operation over the entire vector, you can
treat the vector as a ‘single unit’. Vectorized operations are quite
fast.
Vectors are essential. When you load your data into R (e.g., a spreadsheet), most of the time your columns can be treated as vectors.
x = c(1,2,3,4,5,6,7) # X is now a vector
x ** x # This will raise each item in x to its own power (e.g., ..., 7^7)
## [1] 1 4 27 256 3125 46656 823543
Variables can contain other types of elements. For example, you may want to have a variable with different types of word order:
wordOrder = c("SOV", "SVO", "VOS", "others")
In this case, wordOrder
contains strings (as opposed to
numbers, for example). This vector’s class is therefore
character
.
class(wordOrder)
## [1] "character"
You can access different elements in a vector using slice notation:
vector[n]
, where n
represents the position of
the item you’re looking for.
wordOrder[2]
## [1] "SVO"
In addition, you can assign names to your vector elements.
names(wordOrder) # No names yet.
## NULL
names(wordOrder) = c("item1", "item2", "item3", "item4")
names(wordOrder) # New names
## [1] "item1" "item2" "item3" "item4"
wordOrder # Note that now you have items and names
## item1 item2 item3 item4
## "SOV" "SVO" "VOS" "others"
Because you now have names (or labels) for each item in the vector, you can also call an item by its name instead of its position. This is useful if you have several items in your vector.
wordOrder["item2"]
## item2
## "SVO"
You can also have a vector of class logical
(TRUE
s and FALSE
s). However, you
cannot have a vector with different types of elements.
Vectors will coerce its values to make all elements belong to the same
class (if possible). For example, if your vector contains numbers and
strings, R will coerce the numbers into strings, so that the entire
vector belongs to the class character
.
mixedVector = c("SOV", 2, "VOS", "others", 7)
mixedVector
## [1] "SOV" "2" "VOS" "others" "7"
class(mixedVector)
## [1] "character"
You can see above that number 2 and 7 are printed as strings: “2” and
“7”, and that mixedVector
is, in fact, not mixed at all:
it’s a vector that contains only strings. If you do want to have
different types of elements in your vector, you’ll need a
list.
Simply put, lists are more complex vectors (and they are extremely useful). You can mix different types of elements in a single list. In the example below, we have a list that contains strings, numbers, booleans, a vector, a matrix and another list.
myList = list("A nice sentence", "word", 23, TRUE, 1, FALSE, c("A", "vector", "with", "strings"), matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 3), list("Another list", c(44, 23), TRUE))
## Warning in matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 3): data length differs
## from size of matrix: [6 != 3 x 3]
myList
## [[1]]
## [1] "A nice sentence"
##
## [[2]]
## [1] "word"
##
## [[3]]
## [1] 23
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] 1
##
## [[6]]
## [1] FALSE
##
## [[7]]
## [1] "A" "vector" "with" "strings"
##
## [[8]]
## [,1] [,2] [,3]
## [1,] 1 4 1
## [2,] 2 5 2
## [3,] 3 6 3
##
## [[9]]
## [[9]][[1]]
## [1] "Another list"
##
## [[9]][[2]]
## [1] 44 23
##
## [[9]][[3]]
## [1] TRUE
You can see that myList
has 9 entries, each of which may
contain different types of elements. Given that several items are
included in myList
, it may be a good idea to use line
breaks to organise your code. Also, let’s add labels to the entries.
myList = list( # Now each row is an item in your list
entry1 = "A nice sentence",
entry2 = "word",
aNumber = 23,
entry4 = TRUE,
1,
FALSE,
c("A", "vector", "with", "strings"),
matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 3),
anotherList = list("Another list",
aVector = c(firstNumber = 44, secondNumber = 23), TRUE)
)
## Warning in matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 3): data length differs
## from size of matrix: [6 != 3 x 3]
myList
## $entry1
## [1] "A nice sentence"
##
## $entry2
## [1] "word"
##
## $aNumber
## [1] 23
##
## $entry4
## [1] TRUE
##
## [[5]]
## [1] 1
##
## [[6]]
## [1] FALSE
##
## [[7]]
## [1] "A" "vector" "with" "strings"
##
## [[8]]
## [,1] [,2] [,3]
## [1,] 1 4 1
## [2,] 2 5 2
## [3,] 3 6 3
##
## $anotherList
## $anotherList[[1]]
## [1] "Another list"
##
## $anotherList$aVector
## firstNumber secondNumber
## 44 23
##
## $anotherList[[3]]
## [1] TRUE
You can use the indices in the list to access particular items. For example:
myList[[9]][[2]][2] # 2nd element in the the 2nd item in the 9th entry -> 23
## secondNumber
## 23
myList[[2]] # "word"
## [1] "word"
Since now we have labels, accessing items is much easier:
myList $ anotherList $ aVector[2]
## secondNumber
## 23
A list is like an encyclopedia, where different entries may contain several types of data (from text to figures). We could assign labels (i.e., names) to our entries, which would make accessing data much easier (it would really be like a dictionary). Let’s imagine you want to create a corpus using lists. Each entry is a word, which in turn has several entries (meanings, pronunciation, word class etc.). This is a very simple example, but it shows a nice thing about lists, namely, that you can access different dimensions of data directly.
myDict = list(
nationalisation = list(
word = "nationalisation",
class = "N",
pronunciation = "ˌnæʃ(ə)nələˈzeɪʃən",
meaning1 = "the transfer of a major branch of industry or commerce from private to state ownership or control",
meaning2 = NA,
year = "Unknown",
derivedForm = matrix(c("nation", "-al", "-iz", "-ation"),
ncol = 4,
dimnames = list(c("word:"), c("n", "adj", "v", "n")))
),
anotherEntry = list(
word = "another entry",
class = "another entry",
pronunciation = "anotherentry",
meaning1 = "another entry",
meaning2 = "another entry",
year = "another entry",
derivedForm = "another entry"
)
)
Now, if you wanted to know the pronunciation of the word
nationalisation
, you could type:
myDict $ nationalisation $ pronunciation
## [1] "ˌnæʃ(ə)nələˈzeɪʃən"
If you are planning to use R just to read your data and run some statistical tests, most of the time you will be working with data frames. This is equivalent to your Excel spreadsheet: a table where your columns are different variables. Data frames are not the only class you can use, though. Data tables, for example, are much faster and more flexible, and may be very useful if you have millions and millions of rows—which is usually not the case in Linguistics. I will focus on data frames.
You can create a data frame using the command
data.frame()
. Normally, if you want to create factor levels
in a data frame from scratch, you’ll use commands such as
gl()
to generate levels. Here, I will keep things simple
and create a very short data frame.
You will usually import your data into R, so you won’t create a data frame from scratch. However, it’s worth looking into it as an exercise to learn how you could in principle manipulate other objects of interest.
myDataFrame = data.frame(col1 = c(2,4,1,3,8), col2 = c("target", "filler", "filler", "target", "filler"))
myDataFrame
## # A tibble: 5 × 2
## col1 col2
## <dbl> <chr>
## 1 2 target
## 2 4 filler
## 3 1 filler
## 4 3 target
## 5 8 filler
If you want to access a column in your data, you use $
.
For example, d$AgeSubject
is a vector that contains all the
information in the AgeSubject
column
inside d
. If you simply type
AgeSubject
, R won’t be able to find it, since there’s no
object in your workspace with that name, i.e., it’s a column
inside another object. You could, however,
attach()
your data frame d
, in which case R
would assume that AgeSubject
means
d$AgeSubject
. This seems like a good idea, but it may not
help you if you’re analysing several data frames at once.
You can also transform an object into a data frame with the command
as.data.frame()
. Here, you’ll see that R will create column
names for you.
myDataFrame2 = as.data.frame(myList[[8]]) # Let's use the matrix in the list above
myDataFrame2
## # A tibble: 3 × 3
## V1 V2 V3
## <dbl> <dbl> <dbl>
## 1 1 4 1
## 2 2 5 2
## 3 3 6 3
Both columns and rows can be manually named in a data frame.
colnames(myDataFrame2) = c("col1", "col2", "col3")
rownames(myDataFrame2) = c("row1", "row2", "row3")
myDataFrame2
## # A tibble: 3 × 3
## col1 col2 col3
## <dbl> <dbl> <dbl>
## 1 1 4 1
## 2 2 5 2
## 3 3 6 3
Data frames traditionally have two dimensions (lists can have many, as you can embed matrices and other lists; data tables also allow for multiple dimensions). As a result, you can access any item (cell) as long as you know its column and row. (Nested data frames can have more dimensions, and are incredibly useful. See Extras at the bottom of this page).
myDataFrame2[1,] # This will print the first row and all columns
myDataFrame2[,1] # All rows in the first column
myDataFrame2[1,1] # First row, first column
myDataFrame2[,] # Everything; equivalent to myDataFrame2
First, let’s create another data frame. This time, let’s have three
columns: one for speaker ID, another for condition, another for response
and a final column for reaction time. You’ll notice I use line breaks
below, and that the contents of the function data.frame()
are aligned. This makes things easier when you are reading code, and is
good practice.
df = data.frame(
speaker = c(1,1,2,2,1),
condition = c("filler", "target", "target", "filler", "filler"),
response = c(0,0,1,0,1),
RT = c(1.232, 0.987, 1.333, 3.212, 0.765)
)
df
## # A tibble: 5 × 4
## speaker condition response RT
## <dbl> <chr> <dbl> <dbl>
## 1 1 filler 0 1.23
## 2 1 target 0 0.987
## 3 2 target 1 1.33
## 4 2 filler 0 3.21
## 5 1 filler 1 0.765
In this case, df
is quite small, and you can easily
inspect the entire data frame. However, most of the time your data
frames will be much larger. In those cases, you don’t want to simply
print your data. Rather, you want to summarise the columns and, perhaps,
take a look at the first rows to see if everything is in order. You can
do this with summary()
and head()
.
summary(df)
## speaker condition response RT
## Min. :1.0 Length:5 Min. :0.0 Min. :0.765
## 1st Qu.:1.0 Class :character 1st Qu.:0.0 1st Qu.:0.987
## Median :1.0 Mode :character Median :0.0 Median :1.232
## Mean :1.4 Mean :0.4 Mean :1.506
## 3rd Qu.:2.0 3rd Qu.:1.0 3rd Qu.:1.333
## Max. :2.0 Max. :1.0 Max. :3.212
head(df)
## # A tibble: 5 × 4
## speaker condition response RT
## <dbl> <chr> <dbl> <dbl>
## 1 1 filler 0 1.23
## 2 1 target 0 0.987
## 3 2 target 1 1.33
## 4 2 filler 0 3.21
## 5 1 filler 1 0.765
You probably noticed that R is treating both speaker
and
response
as numbers. As a result, summary(df)
is telling you that the mean of speaker
is
1.4
. Naturally, you want to treat speaker
and
response
as factors, but R doesn’t know that. To change
these variables (columns) into factors, you use the function
as.factor()
. Alternatively, when you want to change the
class of multiple columns in the same data frame, you can use the
function tranform()
—this is what I do below.
# df$speaker = as.factor(df$speakers) # One alternative
# df$response = as.factor(df$response)
df = transform(df, # This is more concise/less repetitive
speaker = as.factor(speaker),
response = as.factor(response)
)
str(df)
## 'data.frame': 5 obs. of 4 variables:
## $ speaker : Factor w/ 2 levels "1","2": 1 1 2 2 1
## $ condition: chr "filler" "target" "target" "filler" ...
## $ response : Factor w/ 2 levels "0","1": 1 1 2 1 2
## $ RT : num 1.232 0.987 1.333 3.212 0.765
Now that the variables in df
are correctly coded, we
could proceed with our hypothetical analysis.
To create a column in your data frame:
# This will manually add three types of proficiency
df$proficiency = c("adv", "int", "int", "adv", "beg")
# Alternatively
df["proficiency"] = c("adv", "int", "int", "adv", "beg")
To export df
, you first need to decide which format you
want to use. I will assume you use csv
, possibly the most
common format in these cases. However, you can export your data in
virtually any format available. Read more here or
here
(more advanced).
write.csv(df, file = "df.csv", row.names = FALSE)
This will export df
as df.csv
, and will
omit row names (numbered rows). If you don’t omit row names, you’ll see
an extra column is added to your file when you open it in Excel, for
example. This extra column will contain row numbers. Finally, note that
df.csv
will be saved in your working
directory, which in this case is my desktop
(setwd()
above).
Let’s assume you start a new R session and you want to import some
data. Because we saved df.csv
in the previous step, let’s
import that particular file. First, however, I will clean my
workspace.
rm(list=ls()) # Clean workspace
ls() # Make sure no object is loaded
## character(0)
df = read.csv("df.csv") # Import df.csv
ls() # Checking that df is the only thing in the workspace
## [1] "df"
Now, let’s take a look at the class of each variable in
df
, like we did above.
str(df)
## 'data.frame': 5 obs. of 5 variables:
## $ speaker : int 1 1 2 2 1
## $ condition : chr "filler" "target" "target" "filler" ...
## $ response : int 0 0 1 0 1
## $ RT : num 1.232 0.987 1.333 3.212 0.765
## $ proficiency: chr "adv" "int" "int" "adv" ...
You will notice that both speaker
and
condition
are treated as integer
by R, even
though we did transform these variables before exporting
df
. The problem here is that csv
files store
data in plain text. As a result, these files have no
‘memory’, i.e., no metadata. Consequently, every time you load a
csv
file, R will assume the default classes for all your
variables, regardless of what you had previously transformed.
This is not a problem if you run a script every time you load your
data, of course. However, a better solution is to export you data using
the RData
format (read more about it here).
RData
vs. csv
Let’s transform again the variables in df
, then export
it as df.RData
.
df = transform(df,
speaker = as.factor(speaker),
response = as.factor(response)
)
save(df, file = "myData.RData")
Notice that myData.RData
contains one object, namely,
df
. However, you can save as many objects as you want. For
example,
save(obj1, obj2, obj3, ..., objn, file = "allData.RData")
will save one single file that contains several
objects. This is extremely useful because you can save your models’
outputs as well as your data objects, for instance, which in turn means
you don’t have to re-run models when you start a new session.
Now, let’s clean our workspace, load myData.RData
, and
check the class of our variables.
rm(list=ls())
load("myData.RData") # This will load df, which is the only object in myData.RData
str(df)
## 'data.frame': 5 obs. of 5 variables:
## $ speaker : Factor w/ 2 levels "1","2": 1 1 2 2 1
## $ condition : chr "filler" "target" "target" "filler" ...
## $ response : Factor w/ 2 levels "0","1": 1 1 2 1 2
## $ RT : num 1.232 0.987 1.333 3.212 0.765
## $ proficiency: chr "adv" "int" "int" "adv" ...
Now, you can see that the variables in df
are correct.
RData
is a great way to save all your objects in an
intelligent way. Another great advantage is that RData
files are compressed, which means they will be much smaller than
csv
files: read this.
There are literally thousands of R packages. If you want to perform a certain task, there is probable at least one package for that. If you can’t find a package, you can create your own—or you can create your own scripts with convoluted functions that perform a series of operations.
To see which packages you have installed, simply type
installed.packages()
. To install a new package, type
install.packages("name_of_package")
. You may have to
specify the repository from which you wish to install a package:
install.packages("name_of_package", repos = "http://cran.r-project.org")
.
Then, to load the package, you type
library(name_of_package)
.
We will use some very important packages during this workshop, so you may want to install them already. We will explore each of these packages during the workshop. Other packages may be suggested along the way.
install.packages(c("dplyr", "ggplot2", "reshape2", "tidyr", "stringr", "scales", "languageR"),
repos = "http://cran.r-project.org")
Your analysis normally has several steps. For example:
You may have a single script (myAnalysis.R
) that
contains all items above. This, however, may not be a good idea. First,
these items are doing very different things; second, your script will be
very long, which makes it harder to understand which part is doing what.
Even if you properly comment everything you do (always a good idea),
your file may not be as readable as it could be.
Instead, one thing you can do is have separate R files/scripts for
your project (which I’ll call X
).
cleanData_X.R
This will read your data and clean
itsummary_X.R
summarised data frames and other
objectsplots_X.R
code for your plotsmodels_X.R
statistical models etc.As a result, there will be a single script for your plots etc. This will make your code more readable when you come back to your analysis in the future.
OK, but these files have to ‘talk to each other’. Here’s the idea:
cleanData_X.R
will read your data and prepare it. It
will also load the packages you’ll need, set the working directory etc.
In other words, it will contain the basic aspects of your analysis.
summary_X.R
will then load
cleanData_X.R
. To do that, you run
source("cleanData_X.R)
. Then R will load all the variables
that you created in cleanData_X.R
but it will not
open that file. This is exactly what you want.
plots_X.R
will load both cleanData_X.R
and
summary_X.R
. As a result, your script will automatically
load all the objects that you may wish to plot (again: without opening
the actual files where you prepare the data). The same goes for
models_X.R
.
The rationale behind this is simple: if you want to replot something,
you don’t necessarily want to revisit the entire analysis, from
importing the raw data to the actual plots. Instead, you just want to
plot things. To do that, you just need to open one file (which
only has your plots). You just need to source()
the required scripts.
You can also save intermediate data frames as your analysis
progresses. For example: let’s say you clean your data frame and add a
column. If you don’t want to re-do this every time you open the script,
you could save a .csv
file with that
column and load that file instead. Of course, this may seem useless
considering that you simply have to run a couple of lines of code, but
if you create your column with a for-loop and your data frame has
200,000 rows, you will certainly not want to wait for the looping to end
every single time you open your script.
The suggestions above may not be very meaningful at this point, but you might want to develop certain organisational skills from the very beginning. If you are new to R, you probably have no experience writing scripts to analyse your own data. The first scripts you produce are usually messy, and eventually you realise that writing clear and organised code is extremely important (for others to understand your analysis, but also for yourself).
Topics: contingency tables, dplyr
, data
summaries, subsets
First and foremost, before you explore your data, you normally have to tidy it. It’s almost always the case that raw data files are very messy, and you can’t really do much until you’ve organised it in some meaningful way. For example, if you use online surveys, you know the output files are not exactly what you need for plotting, modelling etc.
Here, I will assume you have already cleaned your data and are ready
to explore patterns—you can do that with two extremely useful packages:
reshape2
and tidyr
. Read about them here—also,
this
vignette has a nice intro to tidyr
. You will certainly
want to use these packages for organising messy outputs. Also, check the
reading list at the end of this tutorial. In particular, take a look at
this
book.
Let’s load some data. We’ll use the english
dataset,
which comes with the languageR
package. Below, I load the package and the data set, and then assign
english
to a new variable (d
), which will be
used as a shortcut.
library(languageR)
data(english)
d = english
The english
dataset has several columns (36). You can
check the names of all columns by using the names()
command.
names(d)
## [1] "RTlexdec" "RTnaming"
## [3] "Familiarity" "Word"
## [5] "AgeSubject" "WordCategory"
## [7] "WrittenFrequency" "WrittenSpokenFrequencyRatio"
## [9] "FamilySize" "DerivationalEntropy"
## [11] "InflectionalEntropy" "NumberSimplexSynsets"
## [13] "NumberComplexSynsets" "LengthInLetters"
## [15] "Ncount" "MeanBigramFrequency"
## [17] "FrequencyInitialDiphone" "ConspelV"
## [19] "ConspelN" "ConphonV"
## [21] "ConphonN" "ConfriendsV"
## [23] "ConfriendsN" "ConffV"
## [25] "ConffN" "ConfbV"
## [27] "ConfbN" "NounFrequency"
## [29] "VerbFrequency" "CV"
## [31] "Obstruent" "Frication"
## [33] "Voice" "FrequencyInitialDiphoneWord"
## [35] "FrequencyInitialDiphoneSyllable" "CorrectLexdec"
Here’s a description of the dataset (provided in the package documentation):
This data set gives mean visual lexical decision latencies and word naming latencies to 2284 monomorphemic English nouns and verbs, averaged for old and young subjects, with various predictor variables.
Note that reaction times have been log-transformed.
To simplify things a bit, let’s reduce the number of columns. First,
let’s define which columns are of interest. Then, we subset
d
and select only those columns.
columns = c("RTlexdec", "Familiarity", "Word", "AgeSubject", "LengthInLetters", "CV", "Obstruent", "Voice", "WordCategory")
d = d[, columns]
head(d)
## # A tibble: 6 × 9
## RTlexdec Familiarity Word AgeSubject LengthInL…¹ CV Obstr…² Voice WordC…³
## <dbl> <dbl> <fct> <fct> <int> <fct> <fct> <fct> <fct>
## 1 6.54 2.37 doe young 3 C obst voic… N
## 2 6.40 4.43 whore young 5 C obst voic… N
## 3 6.30 5.6 stress young 6 C obst voic… N
## 4 6.42 3.87 pork young 4 C obst voic… N
## 5 6.45 3.93 plug young 4 C obst voic… N
## 6 6.53 3.27 prop young 4 C obst voic… N
## # … with abbreviated variable names ¹LengthInLetters, ²Obstruent, ³WordCategory
This is an example of subsetting your data frame, since we’re reducing the number of columns. You can check the size of your data frame in different ways:
ncol(d) # Number of columns
## [1] 9
nrow(d) # Number of rows
## [1] 4568
dim(d) # Dimension (rows x columns)
## [1] 4568 9
As mentioned before, you can also use summary()
to have
an overview of your data structure, and str()
to check the
class of each variable.
But often you want to look at a subset of rows. For
example, you might want to examine only young speakers’ data. There are
different ways of subsetting your data. The one employed above
(d = d[, columns]
) uses slice notation. Alternatively, you
can use the subset()
command.
young = subset(d, AgeSubject == "young")
young = d[d$AgeSubject == "young", ]
nrow(young) # Note that this is a subset of d
## [1] 2284
The two methods above are equivalent. You may find one more intuitive
than the other, but they’re doing the same thing. When we type
d[d$AgeSubject == "young", ]
, we’re telling R:
d
: d[]
d[X, ]
d$AgeSubject
young
:
== "young"
Now we have created a new data frame that only contains data from
young speakers. This is not a file yet. Likewise, your
original file is still intact: all the changes so far are only real
within this particular R session, given this script. You could export
young
as a csv
file, though, and then you
wouldn’t need to open and run this script again. If you’re coming from
SPSS and Excel, it’s important to know the difference between an object
(e.g., a data frame) and an actual file that contains an object (e.g.,
csv
files).
You can have more complicated subsets. For example, you might want data from young speakers that only contain words that begin with a vowel. I will use two lines of code to make it more readable.
youngV = d[d$AgeSubject == "young" &
d$CV == "V", ]
nrow(youngV)
## [1] 61
You could keep adding constraints to your subset, either by using
&
or by using |
, which means
or (read more about logical operators here).
Finally, you may be looking for data that belong to all but a particular
group. Instead of asking R for young speakers’ data, you could tell it
to give you all data points except for those coming from old speakers.
In other words, young
is the same as not old
in this particular case. The negative of ==
is
!=
.
youngV = d[d$AgeSubject == "young" &
d$CV == "V", ]
youngV = d[d$AgeSubject != "young" &
d$CV != "C", ]
Because we’re dealing with factors that only contain two levels, this
makes little sense. But it may be very useful if you have more levels.
For example, you may want to exclude some of the words from the data.
Let’s say there’s a group of words you do not want. I’m going to call
this group badWords
. To do this you will need another
important operator: %in%
. Let’s see how this works.
badWords = c("trade", "war", "vent") # First, we create a vector with the words
d2 = d[!d$Word %in% badWords, ]
# Alternatively, you could do these two things at once:
d2 = d[!d$Word %in% c("trade", "war", "vent"), ]
# Or perhaps you just don't want "war" in your data:
noWar = d[d$Word != "war", ]
# Which is equivalent to:
noWar = d[!d$Word %in% c("war"), ]
You can see that %in%
is quite useful, and its meaning
is intuitive. Now, d2
has everything
except for those three words. It’s normally a good idea to do things in
steps: first, create a vector to hold the words, and then refer back to
the vector. This is particularly useful if you want to remove dozens of
words, in which case your d[]
line would look quite
messy.
Finally, you may want to remove NA
s from your data. If
you want to remove NA
s from a particular column, you could
type newData = oldData[!is.na(oldData$column), ]
. If you
wanted to remove any row that has at least one NA
in it,
you could type
newData = oldData[complete.cases(oldData), ]
, which returns
only the rows that have no NA
s in them.
xtabs()
and prop.table()
These are two extremely useful commands in R. Both are used to
generate contingency tables. Let’s start with some questions you might
have about d
.
# 1. How many data points are there by subject age?
xtabs(~AgeSubject, data = d) # Which is equivalent to table(d$AgeSubject)
## AgeSubject
## old young
## 2284 2284
# 2. How many data points are there by subject age and by word-initial segment?
xtabs(~AgeSubject + CV, data = d)
## CV
## AgeSubject C V
## old 2223 61
## young 2223 61
# 3. Adding one more dimension: WordCategory
xtabs(~AgeSubject + CV + WordCategory, data = d)
## , , WordCategory = N
##
## CV
## AgeSubject C V
## old 1405 47
## young 1405 47
##
## , , WordCategory = V
##
## CV
## AgeSubject C V
## old 818 14
## young 818 14
You could add more dimensions, but even three categories may be a
little confusing to read depending on the number of levels your factors
have. Now, we can repeat all three tables above with proportions by
using prop.table()
.
# 1. How many data points are there by subject age?
prop.table(xtabs(~AgeSubject, data = d))
## AgeSubject
## old young
## 0.5 0.5
# 2. How many data points are there by subject age and by word-initial segment?
prop.table(xtabs(~AgeSubject + CV, data = d))
## CV
## AgeSubject C V
## old 0.48664623 0.01335377
## young 0.48664623 0.01335377
# 3. Adding one more dimension: WordCategory
prop.table(xtabs(~AgeSubject + CV + WordCategory, data = d))
## , , WordCategory = N
##
## CV
## AgeSubject C V
## old 0.307574431 0.010288967
## young 0.307574431 0.010288967
##
## , , WordCategory = V
##
## CV
## AgeSubject C V
## old 0.179071804 0.003064799
## young 0.179071804 0.003064799
Note that the percentage is calculated across all
cells in the table. This is not a problem is you’re examining
only one dimension, but you may want to change that. To tell
prop.table()
which dimension to use as reference for
percentages, a second argument is used.
prop.table(xtabs(~AgeSubject + CV, data = d)) # All cells add up to 100%
## CV
## AgeSubject C V
## old 0.48664623 0.01335377
## young 0.48664623 0.01335377
prop.table(xtabs(~AgeSubject + CV, data = d), 1) # Rows add up to 100%
## CV
## AgeSubject C V
## old 0.97329247 0.02670753
## young 0.97329247 0.02670753
prop.table(xtabs(~AgeSubject + CV, data = d), 2) # Columns add up to 100%
## CV
## AgeSubject C V
## old 0.5 0.5
## young 0.5 0.5
It’s easy to remember which number to use: count the variables from
the left. The first variable is AgeSubject
, which means
adding 1
to prop.table()
will make that
particular variable add up to 100%.
prop.table(xtabs(~AgeSubject + CV + WordCategory, data = d), 3) # 100% for WordCategory
## , , WordCategory = N
##
## CV
## AgeSubject C V
## old 0.483815427 0.016184573
## young 0.483815427 0.016184573
##
## , , WordCategory = V
##
## CV
## AgeSubject C V
## old 0.491586538 0.008413462
## young 0.491586538 0.008413462
Finally, you may want to round up your proportions. You can do that
using round()
. The first argument is what you want to round
up; the second is the number of decimal digits you prefer.
round(prop.table(xtabs(~AgeSubject + CV + WordCategory, data = d), 3), 2)
## , , WordCategory = N
##
## CV
## AgeSubject C V
## old 0.48 0.02
## young 0.48 0.02
##
## , , WordCategory = V
##
## CV
## AgeSubject C V
## old 0.49 0.01
## young 0.49 0.01
This is a good place to break lines of code. Because you have three functions at once, it may appear confusing at first. Using different lines is helpful:
round(
prop.table(
xtabs(~AgeSubject + CV + WordCategory, data = d),
3),
2)
## , , WordCategory = N
##
## CV
## AgeSubject C V
## old 0.48 0.02
## young 0.48 0.02
##
## , , WordCategory = V
##
## CV
## AgeSubject C V
## old 0.49 0.01
## young 0.49 0.01
Note that it’s much easier to see which function 2
and
3
refer to: RStudio automatically aligns lines for you (the
same goes for text editors such as Sublime).
Naturally, you cand transform your prop.table()
into a
data frame and then use it to make plots etc.
propData = as.data.frame(round(prop.table(xtabs(~AgeSubject + CV + WordCategory, data = d), 3), 2))
# Which is equivalent to (but messier than):
propData = as.data.frame(
round(
prop.table(
xtabs(~AgeSubject + CV + WordCategory, data = d),
3),
2)
)
# Now let's look at propData:
head(propData)
## # A tibble: 6 × 4
## AgeSubject CV WordCategory Freq
## <fct> <fct> <fct> <dbl>
## 1 old C N 0.48
## 2 young C N 0.48
## 3 old V N 0.02
## 4 young V N 0.02
## 5 old C V 0.49
## 6 young C V 0.49
aggregate()
One nice way to summarise your data and examine particular patterns
is to use aggregate()
. For example, you might want to have
the mean reaction time by word category and subject age.
dAgg = aggregate(RTlexdec ~ WordCategory + AgeSubject, data = d, FUN = mean)
dAgg
## # A tibble: 4 × 3
## WordCategory AgeSubject RTlexdec
## <fct> <fct> <dbl>
## 1 N old 6.66
## 2 V old 6.65
## 3 N young 6.44
## 4 V young 6.43
Read more about this function by typing ?aggregate()
in
R.
apply()
Another important way to summarise your data is to use the
apply()
family of functions (read more here,
here
and here).
If you want to apply a particular function to a list, you can use
lapply()
.
For example, you could apply range()
to the
RTlexdec
column by subject age. We’ll do
that using tapply()
, which applies a given function to each
cell in an array. range()
, as you may guess, returns the
min()
and max()
values of an array.
tapply(d$RTlexdec, d$AgeSubject, range)
## $old
## [1] 6.403193 7.187808
##
## $young
## [1] 6.205325 6.879150
dplyr
Possibly one of the most useful package you’ll use to explore your
data, dplyr
allows you to easily perform different
operations all at once. For example, you can use dplyr
instead of xtabs
(and prop.table
) or
aggregate
.
First, let’s compare dplyr
and xtabs
.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Using xtabs:
as.data.frame(xtabs(~AgeSubject + CV + WordCategory, data = d))
## # A tibble: 8 × 4
## AgeSubject CV WordCategory Freq
## <fct> <fct> <fct> <int>
## 1 old C N 1405
## 2 young C N 1405
## 3 old V N 47
## 4 young V N 47
## 5 old C V 818
## 6 young C V 818
## 7 old V V 14
## 8 young V V 14
# Using dplyr:
count(group_by(d, AgeSubject, CV, WordCategory))
## # A tibble: 8 × 4
## AgeSubject CV WordCategory n
## <fct> <fct> <fct> <int>
## 1 old C N 1405
## 2 old C V 818
## 3 old V N 47
## 4 old V V 14
## 5 young C N 1405
## 6 young C V 818
## 7 young V N 47
## 8 young V V 14
# Alternatively:
# d %>%
# group_by(AgeSubject, CV, WordCategory) %>%
# count()
Note that the output of dplyr
above is already a data
frame. As we will see, dplyr
can also work in different
steps (see %>%
below). For example, if we wanted to add
the proportions to the output above (i.e., the equivalent to
prop.table(xtabs()))
:
# Using prop.tables and xtabs
as.data.frame(prop.table(xtabs(~AgeSubject + CV + WordCategory, data = d),3))
## # A tibble: 8 × 4
## AgeSubject CV WordCategory Freq
## <fct> <fct> <fct> <dbl>
## 1 old C N 0.484
## 2 young C N 0.484
## 3 old V N 0.0162
## 4 young V N 0.0162
## 5 old C V 0.492
## 6 young C V 0.492
## 7 old V V 0.00841
## 8 young V V 0.00841
# Using dplyr
count(group_by(d, AgeSubject, CV, WordCategory)) %>%
group_by(WordCategory) %>%
mutate(Freq = n/sum(n))
## # A tibble: 8 × 5
## AgeSubject CV WordCategory n Freq
## <fct> <fct> <fct> <int> <dbl>
## 1 old C N 1405 0.484
## 2 old C V 818 0.492
## 3 old V N 47 0.0162
## 4 old V V 14 0.00841
## 5 young C N 1405 0.484
## 6 young C V 818 0.492
## 7 young V N 47 0.0162
## 8 young V V 14 0.00841
The nice thing about pipelines (%>%
) is that you
could even add a plot to the end of the sequence. This makes sense,
given that you normally transform your data frame so you can plot it (or
model it).
Here are some of the different functions the package has.
arrange()
filter()
group_by()
groups data by certain variablesmutate()
creates a new columnselect()
selects columns in your datasummarise()
The name of each function is very intuitive. You could in fact use
filter()
to subset your data (there are numerous ways of
doing the same thing in R). Let’s take a look at some examples.
# First, let's remove (subtract) some columns
dClean = select(d, -c(Voice, Obstruent, Word))
head(dClean)
## # A tibble: 6 × 6
## RTlexdec Familiarity AgeSubject LengthInLetters CV WordCategory
## <dbl> <dbl> <fct> <int> <fct> <fct>
## 1 6.54 2.37 young 3 C N
## 2 6.40 4.43 young 5 C N
## 3 6.30 5.6 young 6 C N
## 4 6.42 3.87 young 4 C N
## 5 6.45 3.93 young 4 C N
## 6 6.53 3.27 young 4 C N
# Arranging a data frame by a column
orderedRT = arrange(dClean, RTlexdec)
head(orderedRT, 10)
## # A tibble: 10 × 6
## RTlexdec Familiarity AgeSubject LengthInLetters CV WordCategory
## <dbl> <dbl> <fct> <int> <fct> <fct>
## 1 6.21 4.33 young 5 C N
## 2 6.21 4.33 young 5 C V
## 3 6.21 3.87 young 5 C V
## 4 6.21 1.3 young 5 C N
## 5 6.22 6.3 young 3 C N
## 6 6.22 5.5 young 3 C N
## 7 6.23 4.9 young 4 C N
## 8 6.23 5.5 young 5 C N
## 9 6.24 4.1 young 4 C V
## 10 6.24 5.17 young 5 C V
# Arranging by two columns
orderedRT2 = arrange(dClean, LengthInLetters, RTlexdec)
head(orderedRT2, 10)
## # A tibble: 10 × 6
## RTlexdec Familiarity AgeSubject LengthInLetters CV WordCategory
## <dbl> <dbl> <fct> <int> <fct> <fct>
## 1 6.26 6.37 young 2 C V
## 2 6.32 6.47 young 2 C N
## 3 6.32 6.47 young 2 C V
## 4 6.61 6.47 old 2 C N
## 5 6.61 6.47 old 2 C V
## 6 6.61 6.37 old 2 C V
## 7 6.22 6.3 young 3 C N
## 8 6.22 5.5 young 3 C N
## 9 6.24 5.53 young 3 V N
## 10 6.24 5.47 young 3 C V
young = filter(dClean, AgeSubject == "young")
head(young)
## # A tibble: 6 × 6
## RTlexdec Familiarity AgeSubject LengthInLetters CV WordCategory
## <dbl> <dbl> <fct> <int> <fct> <fct>
## 1 6.54 2.37 young 3 C N
## 2 6.40 4.43 young 5 C N
## 3 6.30 5.6 young 6 C N
## 4 6.42 3.87 young 4 C N
## 5 6.45 3.93 young 4 C N
## 6 6.53 3.27 young 4 C N
Now, let’s work with some more interesting cases. Let’s say you want to know the mean reaction time by word length and by word category. You also want to know how many items there are in each category. Finally, you want to order the resulting data frame by mean reaction time within the categories of interest.
dSum = dClean %>%
group_by(WordCategory, LengthInLetters) %>%
summarise(meanRT = mean(RTlexdec), n = n()) %>%
arrange(WordCategory, meanRT)
## `summarise()` has grouped output by 'WordCategory'. You can override using the
## `.groups` argument.
dSum
## # A tibble: 12 × 4
## WordCategory LengthInLetters meanRT n
## <fct> <int> <dbl> <int>
## 1 N 2 6.46 2
## 2 N 3 6.54 474
## 3 N 5 6.55 896
## 4 N 4 6.56 1336
## 5 N 6 6.58 186
## 6 N 7 6.60 10
## 7 V 2 6.45 4
## 8 V 4 6.54 684
## 9 V 3 6.54 202
## 10 V 5 6.54 608
## 11 V 6 6.57 152
## 12 V 7 6.59 14
Here’s what the functions above do:
meanRT
%>%
The result is exactly what we want: mean reaction times by word
category and word length, arranged according to these two variables. You
could now use dSum
to make plots and further explore
patterns of interest.
It’s important to note that the variables will remain grouped inside
the pipeline until you ungroup them by using
%>% ungroup()
. This allows for even more complex sets of
operations, since you could have multiple groupings that rely (or not)
on previous groupings.
One known tricky aspect of dplyr
is that empty
categories are dropped. For example, if a given grouping combination has
no values in it, it will not be shown in the final data frame. This may
have important consequences if you are calculating proportions, for
example. To avoid any possible problems, you can explicitly tell
dplyr
to output complete data frames. You
do that by loading another package, tidyr
, and adding an
extra line to your pipeline. We don’t need that here, but I’ll add it
just as an example.
library(tidyr)
dSum = dClean %>%
group_by(WordCategory, LengthInLetters) %>%
summarise(meanRT = mean(RTlexdec), n = n()) %>%
arrange(WordCategory, meanRT) %>%
ungroup() %>%
complete(WordCategory, LengthInLetters) # will add NAs if needed
## `summarise()` has grouped output by 'WordCategory'. You can override using the
## `.groups` argument.
dSum
When you run this, you’ll notice that the output is a ‘tibble’.
Tibbles are quite useful to visualize your data: unlike traditional data
frames, if you call a tibble, not all rows are printed (10,000 rows are
printed when calling a data frame). As well, the columns are collapsed
to fit the resolution of your window. To create a tibble, you can use
as_data_frame()
(using an existing object) or
data_frame()
(from scratch). To read more about tibbles,
check this.
A final example: counting NA
s using dplyr
.
First, let’s create a data frame with some NA
s.
df_NAs = data.frame(a = c(1,2,5,NA,3,9,NA),
b = c("word1", "word2", NA, "word3", "word4", "word5", "word6"),
c = c(T,T,F,F,T,F,T),
d = c("A", "B", NA, NA, NA, "E", NA))
Now, let’s count the number of NA
s by column:
df_NAs %>% select(everything()) %>%
summarise_all(funs(sum(is.na(.))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## # A tibble: 1 × 4
## a b c d
## <int> <int> <int> <int>
## 1 2 1 0 4
Make sure to read more about other aspects of dplyr
here
or here.
tidyr
Ideally, you want your columns to represent variables and your rows to represent observations. Let’s see an example where that is not the case; i.e., the data is in ‘wide’ format.
someData = data.frame(
a = round(rnorm(10, mean = 8, sd = 0.3),1), b = round(rnorm(10, mean = 6),1), subject = rep(c("speaker1", "speaker2"), each = 5)
)
head(someData)
## # A tibble: 6 × 3
## a b subject
## <dbl> <dbl> <chr>
## 1 8.6 6.4 speaker1
## 2 8.3 6.6 speaker1
## 3 7.8 7.1 speaker1
## 4 8.4 5.7 speaker1
## 5 8.1 5.9 speaker1
## 6 7.5 5.5 speaker2
In the data above, suppose a
and b
are
different tests, which were taken by subjects speaker1
and
speaker2
. We want a single column for test
,
and another column for the scores; i.e., we want to go from ‘wide’
format to ‘long’ format. To do that, we can use the
gather()
function in the tidyr
package
(already mentioned a couple of times).
library(tidyr)
longData = gather(someData,
key = test,
value = score,
a:b)
longData
## # A tibble: 20 × 3
## subject test score
## <chr> <chr> <dbl>
## 1 speaker1 a 8.6
## 2 speaker1 a 8.3
## 3 speaker1 a 7.8
## 4 speaker1 a 8.4
## 5 speaker1 a 8.1
## 6 speaker2 a 7.5
## 7 speaker2 a 8.1
## 8 speaker2 a 7.5
## 9 speaker2 a 8.1
## 10 speaker2 a 7.7
## 11 speaker1 b 6.4
## 12 speaker1 b 6.6
## 13 speaker1 b 7.1
## 14 speaker1 b 5.7
## 15 speaker1 b 5.9
## 16 speaker2 b 5.5
## 17 speaker2 b 7.7
## 18 speaker2 b 6.1
## 19 speaker2 b 5.3
## 20 speaker2 b 6.4
The opposite of gather()
is spread()
, but
you can read more about that in the documentation.
Topics: exploratory data analysis,
ggplot2
Before we start, this is a good post on why the typical chart selection diagrams. Also, this is a very useful post by Andrew Gelman on data visualization.
There are different packages in R for plotting (e.g.,
rCharts
, lattice
, ggplot2
,
plotly
). Here, I will focus on static plots, but you should
definitely take a look at interactive plots such as plotly
and ggvis
as well as Shiny
Apps in general. It all depends on which type of plot best serves
your analysis. Even though we will be focusing on ggplot2
,
let’s first take a look at R’s basic plotting functions, which do not
require any extra packages (read more here).
Perhaps the simplest type of plot is a histogram. Let’s simulate 1000
numbers that come from a normal distribution and then create a
histogram. Below, I use the command set.seed()
to ensure
that the random numbers generated by rnorm()
will always be
the same (otherwise, each time you run rnorm()
you will get
different numbers).
set.seed(5)
myNumbers = rnorm(1000, mean = 5, sd = 1) # Randomly generate 1000 numbers
hist(myNumbers,
main = "This is a histogram", # Plot title
xlab = "Number", ylab = "Density", # Axes labels
freq = FALSE, # To show the prob. density
breaks = 22, # Number of bins*
col = "lightblue") # Colour of bins
#* This number is a suggestion only (see ?hist)
While your plot window is open (i.e., active), you can add ‘layers’ to your plot by running additional lines of code. For example, we can add a density curve to the histogram in question. For colour options in R, check documentation. Other websites can help, e.g., this one.
hist(myNumbers,
main = "This is a histogram", # Plot title
xlab = "Number", ylab = "Density", # Axes labels
freq = FALSE, # To show the prob. density
breaks = 22, # Number of bins
col = "lightblue") # Colour of bins
curve(dnorm(x, mean = mean(myNumbers), sd = sd(myNumbers)),
add = TRUE, # Add curve to active plot
col = "black",
lwd = 3) # Line width
Other types of plot are generated with functions such as
boxplot()
and plot()
—in fact, if you use
plot()
with an object, R will try to plot it with no extra
information (try using plot()
with xtabs()
).
For example, you can create a density plot by typing
plot(density(myNumbers))
. You can find more information
about these types of plots here. Let’s create a simple
data frame where two columns represent normally distributed values with
different means.
set.seed(10)
newData = data.frame(
col1 = rnorm(1000, mean = 5, sd = 2),
col2 = rnorm(1000, mean = 2, sd = 1)
)
head(newData) # Let's take a look at the first rows
## # A tibble: 6 × 2
## col1 col2
## <dbl> <dbl>
## 1 5.04 3.05
## 2 4.63 2.29
## 3 2.26 2.24
## 4 3.80 2.83
## 5 5.59 1.78
## 6 5.78 2.29
We can take a look at histograms for each column. First, we tell R to open a plot window for two plots. Then, we follow the same procedure shown above.
par(mfrow = c(1,2)) # Organise plots: 1 row, 2 cols
hist(
newData$col1,
freq = FALSE,
col = "white",
xlab = "Column 1",
main = "Plot A",
breaks = 20
)
curve(dnorm(x, mean = mean(newData$col1), sd = sd(newData$col1)),
add = TRUE,
lwd = 3
)
hist(
newData$col2,
freq = FALSE,
col = "gray",
xlab = "Column 2",
main = "Plot B",
breaks = 20
)
curve(dnorm(x, mean = mean(newData$col2), sd = sd(newData$col2)),
add = TRUE,
lwd = 3
)
We can do the same using boxplots.
boxplot(newData$col1, newData$col2, # A very simple boxplot
col = "darkseagreen")
We will return to histograms and boxplots below, where I explore in more detail how to adjust your plots.
Read more: This
is an interesting place to visit if you’re already familiar with basic
plotting in R. If you prefer basic plotting to ggplot2
, this
website has nice tips on boxplots in particular (explore the website
for other plots)—you can also try this.
ggplot
packageggplot2
is possibly the most popular package for data
visualization in R. Importantly, there is a huge number of tutorials
online, and the package’s
documentation is excellent. Places like StackOverflow
are also great—you will certainly be visiting these websites if you
decide to use R.
There are two types of plots you can make using ggplot2
(i.e., two main commands). First, you can use qplot()
for
simpler plots. Second, you can use ggplot()
. We will focus
on the latter, but let’s briefly explore qplot()
.
Let’s load some data. We’ll use the english
dataset
again (and we’ll select the same columns as last time).
library(languageR)
data(english)
d = english
This is again the description of the dataset (provided in the package documentation):
This data set gives mean visual lexical decision latencies and word naming latencies to 2284 monomorphemic English nouns and verbs, averaged for old and young subjects, with various predictor variables.
columns = c("RTlexdec", "Familiarity", "Word", "AgeSubject", "LengthInLetters", "CV", "Obstruent", "Voice")
d = d[, columns]
Finally, before we actually start plotting, we need to load
ggplot2
.
library(ggplot2)
qplot()
This is an incremental example:
qplot(variable1, variable2, data)
. As a result, we will
start with the simplest qplot()
possible and then increase
its complexity by adding more dimensions.
Let’s plot RTlexdec
as a function of
Familiarity
(and other variables).
qplot(Familiarity, RTlexdec, data = d)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
We can see that more familiar words yield faster reaction times
overall. Now, let’s add colour (conditional on the age of the subjects,
which here is a factor with only two levels: old
and
young
)—remember you can type str(d)
and
summary(d)
to check what the data look like.
qplot(Familiarity, RTlexdec, data = d, colour = AgeSubject)
As expected, reaction time is faster among young subjects. What about the number of letters in each word? Let’s add this extra dimension by adjusting the size of each data point according to the lendgh of each word.
qplot(Familiarity, RTlexdec, data = d, colour = AgeSubject, size= LengthInLetters)
This doesn’t look very informative, so let’s drop the
size
argument for now. In fact, we know nothing about the
density of the data, i.e., how much clustering there is. We can add
transparency to the data points to solve that.
qplot(Familiarity, RTlexdec, data = d, colour = AgeSubject, alpha = I(0.2))
We could use facets
to investigate yet another variable.
Let’s look at Obstruent
, a factor with two levels
(obstr
and cont
). The question here is whether
the sonority of the first phoneme in the word affects reaction time.
qplot(Familiarity, RTlexdec, data = d, colour = AgeSubject, alpha = I(0.2), facets = ~ Obstruent)
If you wanted to compare reaction times across different factor
levels using boxplots, you’d add an extra argument to
qplot
, namely, geom
. In this case,
geom = "boxplot"
, but geom
can also take
vectors, which means you can have multiple plot layers. In this case,
let’s examine the influence of Obstruent
on reaction time
again.
qplot(Obstruent, RTlexdec, data = d, geom = "boxplot")
qplot()
can be useful for quick plots, so you may want
to use it at times. I will now turn to ggplot()
, which is
the focus of our discussion on data visualization.
To read more about qplot()
, visit this website.
ggplot()
ggplot()
works with layers (L
s below).
Let’s see how to reproduce the scatterplot above using
ggplot()
. We want to create a scatterplot where the x-axis
is Familiarity
and the y-axis is RTlexdec
. We
also want to facet by Obstruent
and add color conditional
on AgeSubject
.
ggplot(data = d, aes(x = Familiarity, y = RTlexdec, colour = AgeSubject)) + # L 1
geom_point(alpha = 0.2) + # L 2
facet_grid(~Obstruent) # L 3
The first layer, L 1
, tells R which data should be
plotted, what the axes should be, and the conditional colour. Note that
information about the axes goes inside aes()
, which stands
for aes
thetics. The second layer tells R the
kind of plot we want to make. geom_point()
is the typical scatterplot. The alpha
parameter is our
transparency. Finally, L 3
tells R that we want to facet
the data by Obstruent
.
Just like geom_boxplot()
, many other geom_
s
are available:
boxplot: geom_boxplot()
histogram: geom_histogram()
bar plot: geom_bar()
scatter plot: geom_point()
spread points: geom_jitter()
lines: geom_line()
violin plots: geom_violin()
density plots: geom_density()
text plots: geom_text()
pie chart: coord_polar()
rug: geom_rug()
… read more here
To get help, always type ?geom_...
.
You can also annotate on your plot by using annotate()
.
Crucially, each ggplot()
layer is separated by a plus sign.
If you use a different line for each layer, it’s easier to deactivate a
particular layer if you need to: you simply use a #
in that
particular line of code.
In ggplot()
, you can control virtually every detail in
your plot. The theme()
layer is responsible for the overall
aesthetics (see below)—there’s a lot of material about it. Read more here.
Naturally, because you can add several layers to your plot, you can end up with a plot that makes no sense (or that has too much information and therefore is not very informative). Let’s create such a plot just to understand a bit more about layers.
ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
geom_jitter(color = "brown", alpha = 0.3) +
geom_boxplot() +
geom_violin(alpha = 0.3, fill = "blue")
This is a very ugly plot, of course, and it also
provides too much information. However, R will not warn you that this is
a weird plot: if something is plottable, R will do it. The nice thing
about using separate lines for each layer in your plot is that you could
‘turn off’ some layers very easily (but remember to end each line with
+
so that R knows your plot isn’t over yet):
ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
#geom_jitter(color = "brown", alpha = 0.3) + This layer is now inactive
#geom_boxplot() +
geom_violin(alpha = 0.3, fill = "blue")
If your data are not normally distributed, using boxplots may be visually misleading (see below). Violin plots, in this case, show the actual distribution of the data points.
Another nice thing about layers is that you can source data from
different data frames. When you use geom_point()
, R sees no
data. It then looks ‘one level up’ and finds
ggplot(data=...)
. As a result, geom_point()
inherits whichever data frame you used in the first layer of your plot.
However, you could override this by providing a
different data frame to geom_point()
. In fact, you could
have multiple geom_point()
s, each of which contains a
different data frame. This can be very useful if you are plotting
different data frames in a single plot. In sum, you have a lot of
flexibility.
ggplot(data = data1, aes(...)) + # Layer 1
geom_point() + # This will use data1
geom_point(data = data2, aes(...)) # This will use data2
geom_violin() + # This will use data1 again
geom_point(data = data3, aes(...)) # You get the idea
Next, let’s make a density plot with a rug at the bottom of the plot. Note that we can also use linebreaks to make a single layer more readable.
ggplot(data = d, aes(x = RTlexdec)) +
geom_density(fill = "blue",
color = "blue", # Line colour
alpha = 0.5) +
geom_rug()
You can also flip the plot (which may or may not make sense). This is yet another layer:
ggplot(data = d, aes(x = RTlexdec)) +
geom_density(fill = "blue",
color = "blue", # Line colour
alpha = 0.5) +
geom_rug() +
coord_flip()
Next, let’s look at geom_boxplot()
.
ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
geom_boxplot()
Again: because you can have layers, you can add the actual data
points to your plot. Let’s do that by using
geom_jitter()
—but let’s add some transparency. Also, note
the the order of the layers matters: if
geom_jitter()
precedes geom_boxplot()
, the
points will be behind the boxplot. Here, because we’ll add
geom_jitter()
after
geom_boxplot()
, the points will be in front
of the boxplots. Finally, note that we’re not using
geom_point()
because the x-axis is categorical, which means
all the points would be stacked vertically.
ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
geom_boxplot() +
geom_jitter(alpha = 0.1) # This won't be very informative
Let’s add some colour. There are basically two ways of adding colour
to your plot: first, you can simply use a particular colour that plays
no particular role. Second, you may want to add colour to represent
different levels of a factor, for example, in which case the colour
itself is informative. If your colour is conditional on
some factor, you should add the color to aes()
. If you just
want things to be colourful, then move the colour outside of
aes()
. For example:
ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
geom_jitter(alpha = 0.1, color = "darkseagreen")
In the plot above, we’re just using colour for aesthetic reasons. Now look at the plot below.
ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
geom_jitter(alpha = 0.3, aes(colour = Obstruent))
Now, we’re telling R to use colour according to a particular factor—which will automatically generate a legend to the right of the plot. R will choose the colours for you (and it doesn’t do a great job most of the time). As a result, you may want to use your own colours: you can even manually create very specific combinations of colours (read more here). There are also palettes you can use, if you are looking for some pre-set combinations of colours.
ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
geom_jitter(alpha = 0.3, aes(colour = Obstruent)) +
scale_colour_manual(values = c("black", "red"))
Note that we’re using scale_colour_manual()
because
we’re setting the colour for the colour()
parameter.
Imagine we were using bars, and we wanted to fill each bar with a colour
that is conditional on some variable. We’d add
geom_bar(aes(fill = variable))
. In that case, if we wanted
to manually set the colour for fill
we would use
scale_fill_manual()
. There are numerous tutorials on how to
use colours in ggplot2
(e.g., this one).
What you could do is define colours that you might like and always use
them. For example, you might want to have plot templates for papers,
more colourful ones for presentations and posters etc. In fact, this is
not only the case for colours: overall, you probably want to spend some
time designing a prototypical plot that satisfies your needs (and
taste), and then use/adapt the template for your analyses.
Naturally, the plots above are not aesthetically pleasing—and don’t make much sense, either. The idea here is to show you what you could do. Let’s use facets now to see whether vowel-initial words yield different reaction times relative to consonant-initial words. I will also further personalise the plot by adding axis labels, a title, and adjusting the width of the boxplot, its fill colour and the plot background.
ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
geom_boxplot(width = 0.5, alpha = 0.3) +
facet_grid(~CV) +
ggtitle("My boxplot\n") + # \n will add a line break
labs(y = "Reaction time\n", x = "\nSubject age") +
theme_bw() # White background
By adding \n
to your text, you can easily create more
room between the plot and your labels. There are other ways you can
adjust the spacing, but this is by far the simplest/easiest to remember
(it adds exactly one line).
Let’s now go back to violin plots. Above, I mentioned that they can be quite useful when your data are not normally distributed. Let’s see an example. First, let’s create a bimodal distribution. We will do that by first creating two normally distributed vectors and then combining them into a single column in a data frame.
x = rnorm(1000, mean = 5, sd = 0.5) # First distribution
y = rnorm(1000, mean = 8, sd = 0.5) # Second distribution
z = data.frame(data=c(x,y), someFactor = gl(1, 2000, 2000, labels = c("Level")))
head(z)
## # A tibble: 6 × 2
## data someFactor
## <dbl> <fct>
## 1 4.47 Level
## 2 4.32 Level
## 3 5.04 Level
## 4 4.55 Level
## 5 6.83 Level
## 6 5.56 Level
The data frame above has two columns, namely, data
and
someFactor
, which has a single level (Level
).
In other words, we’re simulating a group with a bimodal distribution. We
can see that it’s bimodal by plotting a histogram.
ggplot(data = z, aes(x = data)) +
geom_histogram(bins = 30, colour = "white")
If you now plot these data using a boxplot, they will appear to be normally distributed.
ggplot(data = z, aes(y = data, x = someFactor)) +
geom_boxplot(width = 0.5, alpha = 0.5)
This can obviously be misleading, because boxplots assume some normality. Violin plots do not:
ggplot(data = z, aes(y = data, x = someFactor)) +
geom_violin(width = 0.5, alpha = 0.5)
You may also want to use a bar plot to show percentages, even though
this is usually not a great use of space, given that typical bars convey
little information. Let’s take a look at geom_bar()
(which
in many cases can give you the same output as
geom_histogram()
). In this particular case, I’ll add % to
the y-axis using the scales
package.
library(scales)
ggplot(data = d, aes(x = Obstruent, y = (..count..)/sum(..count..))) +
geom_bar(width = 0.5) +
scale_y_continuous(labels = percent_format()) + # To have % in the y-axis
labs(x = "Type of segment", y = "%")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
Note that you could convey the exact same thing with two dots, one for each percentage in the plot (even though that would require some extra work on the data).
What if you wanted to compare the mean (with standard error bars)
reaction time between voiced and voiceless word-initial segments? Here,
we’ll use stat_summary()
, an very useful
function in ggplot()
. First, we’ll add error bars. Then,
we’ll also add a square to represent the actual means. As mentioned
before, you’ll notice that several layers of stat_summary()
can be used.
ggplot(data = d, aes(x = Voice, y = RTlexdec)) +
stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.3) + # Error bars
stat_summary(fun.y = mean, geom = "point", shape = 15, size = 4)
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
By using shape = 15
, I’m telling R to plot the means
(fun.y = mean
) using squares (check the possible shapes here).
Read more about stat_summary()
here. If
you want to use standard deviations instead of standard errors, simply
use mean_sdl
instead of mean_se
. In addition,
you can adjust the multiplier by adding
fun.args = list(mult = 1)
(the default is 2 standard
deviations). Finally, for 95% CIs, you can use
mean_cl_normal
(assuming normality) or
mean_cl_boot
.
What if you want to plot the reaction time of different words by using the actual words in the plot…? Let’s plot reaction time by familiarity and add two trend lines (linear and non-linear). However, let’s focus on words with three letters only: notice that you can subset the data in the plot.
ggplot(data = d[d$LengthInLetters == 3, ], aes(x = Familiarity, y = RTlexdec)) +
geom_text(aes(label = Word), alpha = 0.2) +
geom_smooth(colour = "yellow", size = 2, method = "lm") + # Linear trend line
geom_smooth(colour = "blue", size = 2, method = "loess", linetype = "dashed")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
You can omit the standard error shadow if you add se = F
to geom_smooth()
. Let’s annotate this plot.
ggplot(data = d[d$LengthInLetters == 3, ], aes(x = Familiarity, y = RTlexdec)) +
geom_text(aes(label = Word), alpha = 0.2) +
geom_smooth(colour = "yellow", size = 2, method = "lm") + # Linear trend line
geom_smooth(colour = "blue", size = 2, method = "loess", linetype = "dashed") +
annotate("text", x = 4, y = 7, label = "Hello", fontface = "bold", size = 7, color = "darkgreen")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Read more about annotate()
here.
People normally underestimate the looks of a plot, but being able to easily understand what the plot is trying to convey can be extremely important. Plots are more user-friendly than tables, and it’s great if your results can be plotted in a clear and interesting way.
One very basic thing that people often forget about is font size. Depending on how you save your plot in R (see below), and on how large the plot will be in your paper/presentation, font size can change dramatically. If you plotted using a large window and the font size looked OK, it may be too small in your paper.
The most important thing you need to remember is that
gggplot()
has an entire layer to adjust the looks of your
plot. As mentioned above, it’s called theme()
(read more here). Inside
theme()
you will specify all the aesthetic details you
want. From font face/size and colour to the size of the ticks or the
angle of your x-axis labels: everything is done within
theme()
.
To exemplify theme()
, I’m going to replot reaction time
by voice using errorbars. Let’s see how different the plot will
look.
ggplot(data = d, aes(x = Voice, y = exp(RTlexdec))) + # back-transform y-axis
stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.15, size = 0.35) + # Error bars
stat_summary(fun.y = mean, geom = "point", size = 3) +
theme_bw() +
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5)) + # Center title
labs(x = "\nVoicing", y = "Reaction time", title = "A plot") + # Labels + title
scale_y_continuous(labels = unit_format(unit = "ms", sep = " ")) # add ms to y-axis
One important thing to remember in this particular
case: theme_bw()
comes before
theme()
. In other words, if you want to add a particular
theme (from ggthemes
; see below), it has to come
before theme()
. Otherwise, your
adjustments within theme()
will be ignored.
Several additional parameteres can be changed (check the
documentation). I personally like to use CMU Sans Serif because I like
to write with CMU Serif using \(\LaTeX\), so text and plots look more
harmonic. Before choosing your font, however, you may have to install an
extra package: extrafont
, which may take a little while to
install (read here).
Finally, remember that plots should make things easier,
so it may be useful to assume that ‘less is more’—you might want to
check two packages: ggthemes
and cowplot
.
In many cases, you will want to order your axis labels based on a
given variable (there are many reasons why you might wish to do
something like that). Here’s one (of many) option: you can create a new
column that reorders your levels based on a different
variable. In this example, we’ll use reorder()
.
In this simple example, let’s say you want to plot the mean reaction
time by Frication
, which has four levels. These levels are
automatically sorted alphabetically by R. However, you may want to order
them based on the mean RT in each case. As a result, your plot will show
a nice ascending pattern.
d = english
fricMean = d %>% group_by(Frication) %>% summarise(meanRT = mean(RTlexdec))
fricMean$Frication2 = reorder(fricMean$Frication, fricMean$meanRT)
ggplot(data = fricMean, aes(x=Frication, y=meanRT)) +
geom_point(size = 3) +
theme_bw() +
ggtitle("Frication in alphabetical order") +
labs(x = "Frication", y = "Mean RT")
ggplot(data = fricMean, aes(x=Frication2, y=meanRT)) +
geom_point(size = 3) +
theme_bw() +
ggtitle("Frication ordered by mean RT") +
labs(x = "Frication", y = "Mean RT")
# ALSO:
# You could reorder your levels *in the plot* itself as well, by using the reorder() function within aes(). Example: aes(x=reorder(Frication, meanRT), y = meanRT)
Finally, for plotting MCMC models, you can use the
ggmcmc
package, which prepares the data (using
dplyr
) for plotting using ggplot2
.
If you are using RStudio, you can click on export (Save as PDF), tick
use Cairo_pdf...
, and save it. You can also type
pdf("fileName.pdf")
before your plot and
dev.off()
right after your plot. Run the first command,
then generate your plot, then run the second command. Your plot will be
saved as a PDF file in your working directory.
If you are using R, you can also type cmd + s
(on a Mac)
when the plot window is active. This will save your plot as a pdf, but
note that the resolution of your file will be sensitive to the
actual size of the window when you save it. As a result, if you
generate 10 plots and you want them to have the same size, you cannot
alter the size of the window in the process (besides adjusting the
actual code of every plot, of course). This is very practical, as long
as you remember that the window size affects the size of the plot.
In sum: ggplot2()
is a very powerful
and flexible package for data visualization. You can adjust virtually
every aspect of your plots by using layers of code.
Topics: statistical tests, ANOVA, regression, log-odds, odds, probabilities
The idea here is to focus mostly on the syntax of tests and models, and not on the interpretation of the results (as that would require more than the two hours that we have). You will see that running tests and models in R is quite straight-forward.
We have already seen some of the most basic stats functions that
provide some information about our data. For example,
mean()
, median()
, sd()
,
min()
, max()
, range()
(if your
vector contains NA
s, however, you need to add
na.rm = TRUE
to your functions). Of course, you can also
type summary()
to get a summary of an object (note that
this could be your data, your model, a vector etc.). It’s often the case
that what you want has a mnemonic function in R, as mentioned above. For
example, to print the quantiles of a vector, simply type
quantile()
.
Below, I provide a brief description of some popular statistical
tests as well as models you can run in R. First, let’s revisit the
english
dataset.
library(languageR)
data(english)
d = english
columns = c("RTlexdec", "Word", "AgeSubject", "LengthInLetters", "Familiarity", "Frication", "Voice", "WordCategory")
d = d[, columns]
head(d)
## # A tibble: 6 × 8
## RTlexdec Word AgeSubject LengthInLetters Familiarity Frication Voice WordC…¹
## <dbl> <fct> <fct> <int> <dbl> <fct> <fct> <fct>
## 1 6.54 doe young 3 2.37 burst voic… N
## 2 6.40 whore young 5 4.43 frication voic… N
## 3 6.30 stress young 6 5.6 frication voic… N
## 4 6.42 pork young 4 3.87 burst voic… N
## 5 6.45 plug young 4 3.93 burst voic… N
## 6 6.53 prop young 4 3.27 burst voic… N
## # … with abbreviated variable name ¹WordCategory
Statistical tests are usually not very rich in terms of the amount of information their outputs provide. You probably analyse your data using regression models instead, for example. But it’s nice to start with tests to better understand how R syntax works. Let’s briefly take a look at how you could run some tests.
If you know a test, chances are that the R function for that test has
a very similar name. For example, if you want to run a t-test, you type
t.test()
; if you want to run a Chi-squared test of
independence, you type chisq.test()
(read more about it here).
Let’s assume you want to check if the reaction time is different between
old and young speakers.
t.test(RTlexdec ~ AgeSubject, data = d)
##
## Welch Two Sample t-test
##
## data: RTlexdec by AgeSubject
## t = 67.468, df = 4534.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group old and group young is not equal to 0
## 95 percent confidence interval:
## 0.2152787 0.2281642
## sample estimates:
## mean in group old mean in group young
## 6.660958 6.439237
# Type ?t.test() to see additional arguments for the function
Manually defining groups: when we use
t.test()
, we normally have two arguments (groups)—although
you can also run a t-test on a single sample, of course. But if you have
multiple groups and only want to test two of them, it is useful to
create subsets manually. The first group is the vector of reaction times
that belong to young speakers. To do that, we first subset
(d[d$AgeSubject == "young", ]
) and then select a column
$RTlexdec
(you might want to review subsetting from day 2).
This may seem a little confusing, so you could do it in steps.
group1 = d[d$AgeSubject == "young", ] # or d %>% filter(AgeSubject == "young") using dplyr
group2 = d[d$AgeSubject == "old", ] # or d %>% filter(AgeSubject == "old") using dplyr
t.test(group1$RTlexdec, group2$RTlexdec)
##
## Welch Two Sample t-test
##
## data: group1$RTlexdec and group2$RTlexdec
## t = -67.468, df = 4534.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2281642 -0.2152787
## sample estimates:
## mean of x mean of y
## 6.439237 6.660958
By default, t.test()
will not assume equal variances
between the groups (that’s why the degrees of freedom are different from
what you’d calculate manually in a typical t-test). If you want to
change that, simply add var.equal = TRUE
. If you’re running
a pairwise t-test, add paired = TRUE
to your function.
Finally, if your hypothesis is directional, you should specify that by
adding alternative = "less"
(the default is
two.sided
, but you can have less
or
greater
). In our data, we already expect older speakers to
have slower reaction times.
t.test(RTlexdec ~ AgeSubject, alternative = "greater", data = d)
##
## Welch Two Sample t-test
##
## data: RTlexdec by AgeSubject
## t = 67.468, df = 4534.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group old and group young is greater than 0
## 95 percent confidence interval:
## 0.2163149 Inf
## sample estimates:
## mean in group old mean in group young
## 6.660958 6.439237
You may want to check if two variables are correlated. Below,
var1
and var2
are obviously negatively
correlated. var3
, on the other hand, is simply simulated
from a normal distribution.
var1 = seq(from = 0.5, to = 100, by = 0.5) # 200 numbers from 1 to 100
var2 = seq(from = 100, to = 0.5, by = -0.5) # 200 numbers from 100 to 1
var3 = rnorm(200, mean = 1, sd = 0.1) # 200 random numbers (normally distr)
cor.test(var1, var2)
##
## Pearson's product-moment correlation
##
## data: var1 and var2
## t = -Inf, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -1 -1
## sample estimates:
## cor
## -1
cor.test(var1, var3)
##
## Pearson's product-moment correlation
##
## data: var1 and var3
## t = 0.93182, df = 198, p-value = 0.3526
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07333642 0.20295730
## sample estimates:
## cor
## 0.06607691
In our data frame d
, you might want to see if reaction
time is correlated with familiarity.
cor.test(d$RTlexdec, d$Familiarity)
##
## Pearson's product-moment correlation
##
## data: d$RTlexdec and d$Familiarity
## t = -33.522, df = 4566, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4673865 -0.4208330
## sample estimates:
## cor
## -0.4444097
ANOVAs are run using the aov()
command in R. There’s
also another function, anova()
, which is used to compute
analysis of variance of one or more fitted models (see below). Let’s
focus on the former, which is what people normally want to use.
In d
, one of our variables has four levels:
Frication
.
str(d)
## 'data.frame': 4568 obs. of 8 variables:
## $ RTlexdec : num 6.54 6.4 6.3 6.42 6.45 ...
## $ Word : Factor w/ 2197 levels "ace","act","add",..: 467 2124 1838 1321 1302 1347 434 468 15 1632 ...
## $ AgeSubject : Factor w/ 2 levels "old","young": 2 2 2 2 2 2 2 2 2 2 ...
## $ LengthInLetters: int 3 5 6 4 4 4 4 3 3 5 ...
## $ Familiarity : num 2.37 4.43 5.6 3.87 3.93 3.27 3.73 5.67 3.1 4.43 ...
## $ Frication : Factor w/ 4 levels "burst","frication",..: 1 2 2 1 1 1 1 1 3 2 ...
## $ Voice : Factor w/ 2 levels "voiced","voiceless": 1 2 2 2 2 2 1 1 1 2 ...
## $ WordCategory : Factor w/ 2 levels "N","V": 1 1 1 1 1 1 1 1 1 1 ...
You could be interested in possible differences in reaction time across the different levels of Frication. This time, I will assign the ANOVA to a variable (you will normally do that to keep using the output of a test/model in your analysis). Perhaps before running our ANOVA, we might want to look at a plot (which is what you would normally do in a real case).
library(ggplot2)
ggplot(data = d, aes(x = Frication, y = RTlexdec)) +
geom_boxplot()
Clearly, there doesn’t seem to be any difference here. If we now run our ANOVA, we’ll confirm that.
# One-way ANOVA:
myAnova = aov(RTlexdec ~ Frication, data = d)
summary(myAnova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Frication 3 0.01 0.00179 0.073 0.975
## Residuals 4564 112.45 0.02464
You could perhaps add one more factor, namely, Voice
.
Note, however, that this is a 2-level factor (the only
factor with more than two levels in d
is
Frication
.)
# Two-way ANOVA:
myAnova2 = aov(RTlexdec ~ Frication + Voice, data = d)
summary(myAnova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Frication 3 0.01 0.00179 0.073 0.97457
## Voice 1 0.22 0.21722 8.831 0.00298 **
## Residuals 4563 112.23 0.02460
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If you wanted to run a post-hoc test such as Tukey HSD to check for
different comparisons in Frication
(you wouldn’t in this
case, of course), you can simply use the TukeyHSD()
function in R. You could also plot the differences in means by simply
using the plot()
function (remember that R will
automatically apply plot()
to an object—as long as it’s
plottable, naturally).
TukeyHSD(myAnova2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = RTlexdec ~ Frication + Voice, data = d)
##
## $Frication
## diff lwr upr p adj
## frication-burst 0.0022773170 -0.01136649 0.01592113 0.9735233
## long-burst 0.0041207202 -0.03986055 0.04810199 0.9950926
## short-burst 0.0008970196 -0.01504220 0.01683624 0.9989216
## long-frication 0.0018434033 -0.04224657 0.04593338 0.9995563
## short-frication -0.0013802973 -0.01761706 0.01485646 0.9963202
## short-long -0.0032237006 -0.04807714 0.04162974 0.9977649
##
## $Voice
## diff lwr upr p adj
## voiceless-voiced -0.009998164 -0.01914064 -0.0008556832 0.0320877
plot(TukeyHSD(myAnova2, "Frication"))
Read more about ANOVAs here. Also, type
?aov()
to better understand the function itself.
Like t-tests, running other tests is very easy in R. If you Google
x test in R
, you will certainly come across numerous
websites with examples. Before we start, if you want to check for
normality:
shapiro.test(d$RTlexdec)
##
## Shapiro-Wilk normality test
##
## data: d$RTlexdec
## W = 0.98595, p-value < 2.2e-16
qqnorm(d$RTlexdec)
qqline(d$RTlexdec)
Be careful with misconceptions about statistical tests. Read about normality tests here.
For example, if you wanted to run a Wilcox test or a Kruskal Wallis
test, you’d type wilcox.test()
and
kruskal.test()
, respectively. Read about them here
and here
A comprehensive list of tests and their respective R syntax can be found here.
Simple linear regressions are run using the lm()
function in R. You don’t need to load any packages for that. If you want
to run a hierarchical linear regression, then you’ll need to load the
lme4
package. Unfortunately, we don’t have enough time in
this introductory workshop to cover hierarchical models.
Let’s predict reaction time as a function of word familiarity.
lModel1 = lm(RTlexdec ~ Familiarity, data = d)
summary(lModel1)
##
## Call:
## lm(formula = RTlexdec ~ Familiarity, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49212 -0.11285 -0.00596 0.10569 0.65072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.780397 0.007178 944.59 <2e-16 ***
## Familiarity -0.060676 0.001810 -33.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1406 on 4566 degrees of freedom
## Multiple R-squared: 0.1975, Adjusted R-squared: 0.1973
## F-statistic: 1124 on 1 and 4566 DF, p-value: < 2.2e-16
# You can calculate the confidence intervals by using confint functions such as:
confint.lm(lModel1)
## 2.5 % 97.5 %
## (Intercept) 6.7663247 6.79446982
## Familiarity -0.0642243 -0.05712723
We can see that Familiarity
has a significant effect on
RTlexdec
: \(\beta =\)
-0.0607, \(p < 0.001\). Each unit of
increase in Familiarity
lowers RTlexdec
by
0.0607. We can replot the data to check that:
ggplot(data = d, aes(x = Familiarity, y = RTlexdec)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
The negative effect we find in our model represents the negative
trend line in our plot above. Let’s add AgeSubject
to our
model.
lModel2 = lm(RTlexdec ~ Familiarity + AgeSubject, data = d)
summary(lModel2)
##
## Call:
## lm(formula = RTlexdec ~ Familiarity + AgeSubject, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38126 -0.05907 -0.00418 0.05134 0.53986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.891258 0.004595 1499.82 <2e-16 ***
## Familiarity -0.060676 0.001113 -54.52 <2e-16 ***
## AgeSubjectyoung -0.221721 0.002558 -86.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08643 on 4565 degrees of freedom
## Multiple R-squared: 0.6967, Adjusted R-squared: 0.6966
## F-statistic: 5244 on 2 and 4565 DF, p-value: < 2.2e-16
This is clearly a better model (with a much higher \(R^2\)). Given the clear effect of
AgeSubject
, we wouldn’t want to leave this predictor out.
We can see that lModel2
is significantly better than
lModel1
by using the anova()
function.
anova(lModel2, lModel1)
## # A tibble: 2 × 6
## Res.Df RSS Df `Sum of Sq` F `Pr(>F)`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4565 34.1 NA NA NA NA
## 2 4566 90.2 -1 -56.1 7515. 0
If you want to to check the residuals of your model:
plot(residuals(lModel2), col = "gray")
There are several ways to plot a fitted model in R. One nice way is
to use the sjPlot
package, which is based on
ggplot2
. As a result, if you’re familiar with the
aesthetics of ggplot2
, sjPlot
will look
familiar.
library(sjPlot) # Load package
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
plot_model(lModel2)
Because the standard errors are very small, and because the estimates are relatively far from each other on the x-axis, you can’t actually see error bars—but they’re there.
You can adjust several parameters in a sjp
plot. Spend
some time playing around with this package: you can use it for linear
and logistic regression (including hierarchical/mixed-effects models).
Read more here.
To run logistic regressions, you’ll use glm()
in R.
Let’s predict age group by reaction time. Recall that logistic
regressions provide estimates in log-odds.
glmModel1 = glm(AgeSubject ~ RTlexdec, data = d, family = "binomial")
summary(glmModel1)
##
## Call:
## glm(formula = AgeSubject ~ RTlexdec, family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3951 -0.5312 0.0235 0.4229 3.6178
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 128.6308 3.8008 33.84 <2e-16 ***
## RTlexdec -19.6497 0.5804 -33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6332.6 on 4567 degrees of freedom
## Residual deviance: 3145.8 on 4566 degrees of freedom
## AIC: 3149.8
##
## Number of Fisher Scoring iterations: 6
Let’s briefly walk through this
AgeSubject
has two levels: old
and
young
old
(alphabetical)old
= 0 and young
= 1young
Our model can be expressed as \(log(p/(1-p)) = 128.63 + (-19.64*RT)\). The
intercept tells us the log-odds of being in the young
group
when log(RTlexdec)
is zero (which is equivalent to 1ms in
the original RT scale, since exp(0) = 1
). This is not a
meaningful interpretation here given the scale used, so let’s ignore the
intercept and focus on what you probably want to know.
Let’s assume a reaction time of 6 (again: remember that this is
log-transformed in the english
data set). In that case, the
model predicts that the log-odds of being in the young
group is \(128.63 + (-19.64 * 6) =
10.79\). Now let’s assume a reaction time of 7: \(128.63 + (-19.64 * 7) = -8.85\). The
difference between these two predictions is \(-8.85 - (+10.79) = -19.64\), which is
exactly the estimate for RTlexdec
in the model above. As a
result, we can say that 19.64 is the difference in log-odds of
p(AgeSubject == young)
for a unit increase in the predictor
RTlexdec
. This difference is equivalent to an odds ratio of
approximately \(exp(|-19.64|) =
338\).
This means that the probability of being young
if your
reaction time increases by 1 unit goes down nearly 100% (338/(1+338)).
This seems quite radical because the unit increase is too large: the
range of reaction time values we have is [6.205325, 7.187808]. As a
result, by increasing one unit (from 6 to 7) we’re visiting the extremes
of the range (recall that RTlexdec
is in the log
scale.). Naturally, the interpretation of the results depend on
the units in question. Sometimes it will make sense to work with smaller
or larger units if you want to report the actual interpretation.
Just so you can see why this makes sense, let’s replot reaction time by subject age.
ggplot(data = d, aes(x = AgeSubject, y = RTlexdec)) +
geom_boxplot()
You could also backtransform your data. Let’s create a new column that contains the original reaction time values (in ms). Then let’s check the range of RT values we observe in the data.
d$originalRT = exp(d$RTlexdec)
range(d$originalRT)
## [1] 495.38 1323.20
Note that now your units are in ms, which may not be very good for
interpreting estimates because it’s too small to matter. Let’s say you
want to know how much increase in the probability of being
young
there is for every 100ms. This would be more
meaningful.
d$RT100 = d$originalRT / 100
glmModel2 = glm(AgeSubject ~ RT100, data = d, family = "binomial")
summary(glmModel2)
##
## Call:
## glm(formula = AgeSubject ~ RT100, family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3244 -0.5309 0.0420 0.4471 3.8918
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 19.37245 0.57048 33.96 <2e-16 ***
## RT100 -2.77268 0.08162 -33.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6332.6 on 4567 degrees of freedom
## Residual deviance: 3199.7 on 4566 degrees of freedom
## AIC: 3203.7
##
## Number of Fisher Scoring iterations: 6
By dividing originalRT
by 100, each new unit represents
100ms. Our new estimate for RT is -2.77 (in log-odds). The negative sign
shows the direction of the effect: longer RT lowers the probability of
being young
. In other words, this means that by adding
100ms to the RT, the odds of being young
go down by a
factor of exp(2.77) = 15.95
. Let’s imagine three people:
one with with a reaction time of 500ms, another with 600ms, and another
with 1000ms. Then we can ask what the probabilities of being young are
for these people based on our model. Note that the difference between
person 1 and person 2 is equivalent to a unit of RT100
.
person1 = 19.37 + (-2.77*5) # log-odds for person 1
person2 = 19.37 + (-2.77*6)
person3 = 19.37 + (-2.77*10)
person2-person1 # difference in log-odds (=estimate)
## [1] -2.77
exp(person1) # person 1 in odds
## [1] 249.635
exp(person1)/(1+exp(person1)) # person 1 in probabilities
## [1] 0.9960101
exp(person2)/(1+exp(person2)) # person 2 in probabilities
## [1] 0.9399133
exp(person3/(1+exp(person3))) # person 3 in probabilities
## [1] 0.0002416569
You don’t have to do this manually, of course. Let’s imagine you have new speakers with associated RTs. You want to know the changes in probability of being old/young as a function of their RTs. Let’s assume that our model is a good model and let’s use it. Here’s what we are going to do:
RT100
)RT100
newRTs = data.frame(
RT100 = seq(from = 5, to = 10, by = 0.5) # From 500ms to 1000ms by 50ms
)
preds = predict(glmModel2, newRTs)
probPreds = exp(preds)/(1+exp(preds))
plot(probPreds, type = "b",
ylab = "Probability of being young",
xlab = "Reaction time (1 unit = 100ms)")
Note that above we’ve transformed log-odds into probabilities
manually. You don’t need to do that, though. Simply add
type = "response"
to the predict()
function:
newRTs = data.frame(
RT100 = seq(from = 5, to = 10, by = 0.5) # From 500ms to 1000ms by 50ms
)
preds2 = predict(glmModel2, newRTs, type = "response")
plot(preds2, type = "b",
ylab = "Probability of being young",
xlab = "Reaction time (1 unit = 100ms)")
This is a good tutorial on how to interpret logistic regression results.
Even though this tutorial does not cover tidying up your data, that’s normally the very first thing you’ll have to do after you import your file into R. There are several important aspects involved in organizing your data frame, but let’s just focus on one aspect (one which I think is crucial). If you’ve used websites to collect data (e.g., SurveyGizmo), you probably know that the output is messy. You normally have one participant per row, which means you’ll have a bunch of columns storing the same kind of data. So if you have 100 questions of the same type, you won’t have a single column storing 100 values (the long format). Instead, you’ll have 100 columns (i.e., the wide format). Needless to say, this is not what you want.
First, let’s see what is meant by a wide data frame.
Here’s an example (let’s call this data frame wide1
.)
## # A tibble: 3 × 6
## participant origin age response_1 response_2 response_3
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 subject1 USA 24 4.03 5.24 10.3
## 2 subject2 Portugal 32 3.93 5.35 9.90
## 3 subject3 Canada 41 4.91 3.68 8.79
You can’t possibly model a response variable that spans over multiple
columns. So let’s go from wide to long
using the gather()
function in the tidyr
package.
long1 = wide1 %>%
gather(key = stimulus, value = RT,
response_1:response_3) %>%
arrange(participant)
long1
## # A tibble: 9 × 5
## participant origin age stimulus RT
## <fct> <fct> <dbl> <chr> <dbl>
## 1 subject1 USA 24 response_1 4.03
## 2 subject1 USA 24 response_2 5.24
## 3 subject1 USA 24 response_3 10.3
## 4 subject2 Portugal 32 response_1 3.93
## 5 subject2 Portugal 32 response_2 5.35
## 6 subject2 Portugal 32 response_3 9.90
## 7 subject3 Canada 41 response_1 4.91
## 8 subject3 Canada 41 response_2 3.68
## 9 subject3 Canada 41 response_3 8.79
Inside gather()
, you need to specify the
key
(the name of the new column that will hold the names of
the columns you’re gathering) and the value
, i.e., the
content column that will hold the actual RT values. Naturally, you could
adjust the actual content of the column stimulus
(e.g.,
remove the response_
bits and keep the number). Now let’s
go from long to wide using the
function spread()
.
wide_again = long1 %>%
spread(key = stimulus, value = RT)
wide_again
## # A tibble: 3 × 6
## participant origin age response_1 response_2 response_3
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 subject1 USA 24 4.03 5.24 10.3
## 2 subject2 Portugal 32 3.93 5.35 9.90
## 3 subject3 Canada 41 4.91 3.68 8.79
What if you have two sets of multiple columns that you need to gather? For example, suppose each of your questions is accompanied by a certainty scale.
## # A tibble: 3 × 9
## participant origin age response_1 certai…¹ respo…² certa…³ respo…⁴ certa…⁵
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 subject1 USA 24 3.15 3 4.10 2 9.86 5
## 2 subject2 Portugal 32 4.35 4 6.35 5 9.49 4
## 3 subject3 Canada 41 3.24 4 6.80 1 8.59 3
## # … with abbreviated variable names ¹certainty_1, ²response_2, ³certainty_2,
## # ⁴response_3, ⁵certainty_3
This is a trickier case, and there are different ways to do it. I’ll
do it in three steps. First, I’ll ignore all columns that contain
certainty
, and gather()
the
response
columns—just like what was done above. That will
be one data frame. Second, I’ll do the same thing, but
this time focusing on certainty
. Third, I’ll get the result
of step 1 and add a new column to it. This column
(certainty
) will come from the result of step 2. I’ll also
arrange the data frame by participant
to make it easier to
check if the output is correct given the original wide data frame.
# Step 1
long2a = wide2 %>%
select(-contains("certainty")) %>%
gather(key = stimulus, value = RT, response_1:response_3, factor_key = TRUE)
# Step 2
long2b = wide2 %>%
select(-contains("response")) %>%
gather(key = stimulus, value = certainty, certainty_1:certainty_3, factor_key = TRUE)
# Step 3
long2 = long2a %>%
mutate(certainty = long2b$certainty) %>%
arrange(participant)
long2
## # A tibble: 9 × 6
## participant origin age stimulus RT certainty
## <fct> <fct> <dbl> <fct> <dbl> <dbl>
## 1 subject1 USA 24 response_1 3.15 3
## 2 subject1 USA 24 response_2 4.10 2
## 3 subject1 USA 24 response_3 9.86 5
## 4 subject2 Portugal 32 response_1 4.35 4
## 5 subject2 Portugal 32 response_2 6.35 5
## 6 subject2 Portugal 32 response_3 9.49 4
## 7 subject3 Canada 41 response_1 3.24 4
## 8 subject3 Canada 41 response_2 6.80 1
## 9 subject3 Canada 41 response_3 8.59 3
So all you need to know are gather()
,
spread()
. Other packages also offer similar tools (though
they may be less intuitive). Examples include reshape2
and
data.table
.
Functions are naturally useful pieces of code for obvious reasons. You may want to perform a given sequence of steps in R, and writing a function that does that is a way to automate the whole process. Functions have a similar structure in different languages, and in R you’ll need curly brackets. Let’s suppose you want to create a function that takes a number and returns that same number minus its square root. Let’s define and apply the function.
myFunction = function(x){
return(x-sqrt(x))
}
myFunction(9)
## [1] 6
myFunction
will take your input, the variable
x
, and perform a given process. Inside the function, you
could have any number of things, including additional functions (i.e., a
nested/embedded function). You can also personalize your function
output, create error messages etc. For example, in this case your input
has to be a number. If it’s not, myFunction
won’t be able
to return an output. You can also print a more interesting result,
embedding the output in a sentence.
myFunction = function(x){
output = paste("Your result is", x-sqrt(x))
return(output)
}
myFunction(9)
## [1] "Your result is 6"
Now, we use the function paste()
to concatenate
variables and strings, store it in a variable, and return that variable.
This makes your function more ‘user-friendly’, so to speak. Any function
you use in R has the same essential structure, the only difference being
the number of arguments you need to pass to the function (here, only 1)
(and, obviously, the complexity of the processes inside the
function).
IF-statements are an important kind of function, which logically
evaluates a given input (TRUE
or FALSE
) and
performs some action(s) based on that. You can also add an
else
component, as well as intermediate steps, or
if else
. Here’s an example: assume we have two variables,
x
and y
. In what follows, we first evaluate if
these variables have the same value (they don’t). If they don’t, then we
evaluate if x
is smaller than y
(it isn’t).
Finally, it must be the case that x
is greater than
y
, so we print x is greater than y
. Note that
the curly brackets’ position ensures that the whole expression is
interpreted as a single line (it also helps to keep the statements more
organized).
x = 10 # Let's define two variables first
y = 5
if(x == y){
print("x and y are the same")
} else if(x < y){
print("x is smaller than y")
} else{
print("x is greater than y")
}
## [1] "x is greater than y"
Let’s create a function that contains an IF-statment in it. Let’s
assume you want a function that tells you wether a number is even or
odd. In other words, you give this function a numeric input and it
returns one of two sentences, namely, ‘your number is even’ or ‘your
number is odd’. To do this, we’ll tell the function to divide our input
by 2. If the remaining is 0, the number is even. Else it’s odd. We’ll do
this by using the modulo symbol: %%
. Let’s call this
function myNumber()
. I’ll define the function first, and
then use it with three examples.
myNumber = function(number){
if(number %% 2 == 0){
return('Your number is even')
} else {return('Your number is odd')
}
}
# Let's use it with 5 now.
myNumber(5)
## [1] "Your number is odd"
# And with 3870
myNumber(3870)
## [1] "Your number is even"
# And with some expression
myNumber(9478 + 37 * 2)
## [1] "Your number is even"
IF-statements are incredibly useful when you need to reshape your
data. For example, you may need to create a new column with values that
are dependent on other columns. One alternative to IF-statements is the
ifelse()
function, which can also be embedded (so you can
nest ifelse
s in one single ifelse()
function).
Thus, if you know how to create new columns (we saw this in previous
posts) and you know how to handle conditionals, you’re ready to make use
of both.
Finally, let me show you one example of ifelse()
.
myNumber2 = function(number){
ifelse(number %% 2 == 0, return('Even'), return('Odd'))
}
myNumber2(3)
## [1] "Odd"
If you need something simple, ifelse()
functions are
quite useful. Even if you need one intermediate step it’s still ok. For
example, ifelse(x, y, ifelse(z, w, a))
tells you that if
x
is TRUE
, then do y
. If
x
is not TRUE
, then evaluate the following: if
z
is TRUE
, then do w
; otherwise,
do a
. But you can see that as your level of embedding gets
more and more complicated, ifelse()
might not be the best
way to go.
Let’s say you want to perform a sequence of actions for every row in
your data frame. It may be easy, but doing that for all 1500 rows in
your dataset will be very time-consuming. One way to automate the whole
process is to use a loop. This is not the best way to
automate actions if you have a LOT of data (loops can be quite slow),
but if you only have a relatively small dataset it’s a good way to make
things work. For now, we won’t be doing that kind of loop. Instead,
let’s just work with a simple example: assume you want to create a
sequence of 50 numbers. This can be easily done with
seq(1:50)
in R, so you wouldn’t want a for-loop to do that.
But let’s say you want the sequence to follow a specific pattern,
namely, each number (from 1 to 50) in the sequence should be added to
itself squared. So 1 would be 2; 2 would be 6 and so forth. This is what
the following loop does. See comments.
outcome = c() # First, we create an empty vector to store each number in the sequence
for(i in 1:50){
outcome[length(outcome)+1] = i + i^2 # This adds the number to the empty vector
}
outcome # Let's print the vector and see if the numbers are there
## [1] 2 6 12 20 30 42 56 72 90 110 132 156 182 210 240
## [16] 272 306 342 380 420 462 506 552 600 650 702 756 812 870 930
## [31] 992 1056 1122 1190 1260 1332 1406 1482 1560 1640 1722 1806 1892 1980 2070
## [46] 2162 2256 2352 2450 2550
If you’re not familiar with coding: Everything
inside the for-loop will be done to each number in the sequence. So for
each number i
in the sequence from 1 to 50, add
i + i^2
to outcome
. How? Simple: look at the
length of outcome
and add 1. When it’s empty, its length is
zero, so the first number will be added to position 1 (0 + 1). The
second time the foor loop runs, the length of outcome
will
be 1, so the second number generated will be added to position 2 (length
of outcome + 1). It’s easy, practical and very powerful, since you can
do very intricate things with loops.
Note that outcome
is an empty vector at first (so its
length is 0
). As a result, we have to take that into
account inside the loop. However, it’s faster if we
already tell R how many “slots” we’ll be using up front (the code is
also more readable, given that now we have the indices from the
start).
outcome = vector("double", length = 50)
# This creates a vector of length 50. double is simply a type of value (a number).
for(i in seq_along(outcome)){
outcome[i] = i + i^2
}
# seq_along(x) is the same as 1:length(x). However, if x is empty, seq_along() will return the correct output (which is empty), whereas 1:length(x) will count back from 1 to 0, returning "1 0" instead.
While-loops are also quite important, and follow the same pattern as for-loops, except that you perform a sequence of actions while a given parameter is met. You can read more about loops in general here. The idea here is to introduce you to some basic concepts so you’ll be able to navigate through the thousands of websites on the topic.
You probably remember plotting quadratic equations when you were in
high school. You can naturally do that in R as well: there are some
pre-defined functions (distributions) that come with R, but you are also
able to create your own function and use it in a plot. We do that by
using stat_function(fun = ?)
in our ggplot
.
Simply replace ?
with whatever function you’d like. For
example, let’s say you have the function \(4x^2 - 3x + 9\), and you want to plot it
assuming x
goes from -5
to
+5
.
library(ggplot2)
myFunction = function(x){4 * x ^ 2 - 3 * x + 9}
ggplot(data=data.frame(x=seq(-5,5)), aes(x)) +
stat_function(fun=myFunction) +
theme_bw() + xlab('x-axis') + ylab('y-axis')
You don’t need to define a function before plotting it.
Instead, you could do both things at the same time (though it might look
a little messier). Now, let’s use a pre-defined function in R:
exp
. This time, let’s have a dashed line.
ggplot(data=data.frame(x=seq(-5,5)), aes(x)) +
stat_function(fun=exp, linetype='dashed') +
theme_bw() + xlab('x-axis') + ylab('y-axis')
It’s often useful to create plots that show distributions, so I’ve created a simple script that does that. You can check it out here. For example, you might want to create a simple plot to show what a normal distribution is and how it changes depending on the standard deviations. Here’s an example.
ggplot(data.frame(j=c(-20,20)), aes(x=j)) +
stat_function(fun=function(x){dnorm(x,mean=0,sd=1)}, aes(color='A')) +
stat_function(fun=function(x){dnorm(x,mean=0,sd=2)}, aes(color='B')) +
stat_function(fun=function(x){dnorm(x,mean=0,sd=3)}, aes(color='C')) +
stat_function(fun=function(x){dnorm(x,mean=0,sd=4)}, aes(color='D')) +
stat_function(fun=function(x){dnorm(x,mean=0,sd=5)}, aes(color='E')) +
theme_bw() + theme(text=element_text(size=10, vjust=1)) +
ggtitle("Normal Distribution") + ylab(NULL) + xlab(NULL) +
scale_color_manual('SD', values = c('black', 'red', 'blue', 'grey', 'brown'), labels=c('1', '2', '3', '4', '5'))
The code looks a bit messy, but I basically added as many
stat_function()
as necessary, so that multiple plots
(distributions) can be plotted within the same ggplot()
.
This looks obvious, but being able to stack different plots onto a
single ‘canvas’ is a very important feature in ggplot2
.
Nested data frames can be very useful to keep your data organised in a single data frame. The idea is this: imagine if a cell in your data frame is a list instead of a single value. Lists, as you know, are very powerful, so they can hold anything. As a result, a cell could contain an entire data frame, which is “hidden” until you decide to explore it.
Let’s go through a simple example with the english
data
set again. Suppose I want to separate the speakers in the data, and
separate the data by Frication
. Then, I want to run models
that predict RTlexdec
based on Familiarity
,
and then I want to compare the \(R^2\)
values for all these models, to spot some trends. This may not be a
model you wish to actually work with, but as a step in your analysis it
may be very useful—in particular when you have a lot of data to look
at.
First, we have to create a column for speakers, since
english
has no such a column. Given the description of
english
in the languageR
documentation, I
assume there are two speakers (but it doesn’t matter: let’s just
simulate them). Each unique word in english
is duplicated,
so we have pairs of words. Let’s arrange the data by word, and then add
sequences of speaker_1
and speaker_2
to a new
column, Speaker
.
# I'll reload the packages, just to make it more explicit
library(languageR)
library(dplyr)
library(tidyr) # This will be used to nest data frames
library(broom) # This will be used to explore model structures
library(purrr) # This will be used instead of for-loops, basically
library(ggplot2)
data(english) # Let's reload "english", just in case
# Make english a tibble so it's easier to visualize it
english = as_data_frame(english)
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
# Add speaker column (make sure you follow what's being done below)
eng = english %>%
arrange(Word) %>%
mutate(Speaker = as.factor(rep(c("speaker_1", "speaker_2"), times = nrow(english)/2)))
Now we have a new column, Speaker
. Let’s group
everything by Speaker
and Frication
and hide
everything else. You’ll see how clean the output is.
# Nest data frame
byFrication = eng %>%
group_by(Speaker, Frication) %>%
nest() %>%
arrange(Speaker)
# Now, take a look at what our data look like
byFrication
## # A tibble: 8 × 3
## Frication Speaker data
## <fct> <fct> <list>
## 1 long speaker_1 <tibble [44 × 35]>
## 2 short speaker_1 <tibble [490 × 35]>
## 3 burst speaker_1 <tibble [920 × 35]>
## 4 frication speaker_1 <tibble [830 × 35]>
## 5 long speaker_2 <tibble [44 × 35]>
## 6 short speaker_2 <tibble [490 × 35]>
## 7 burst speaker_2 <tibble [920 × 35]>
## 8 frication speaker_2 <tibble [830 × 35]>
# The column "data" contains several tibbles, which can be accessed the same way you access items in a list.
# Accessing first nested data:
byFrication$data[[1]] # this will be speaker 1, Frication == long
## # A tibble: 44 × 35
## RTlex…¹ RTnam…² Famil…³ Word AgeSu…⁴ WordC…⁵ Writt…⁶ Writt…⁷ Famil…⁸ Deriv…⁹
## <dbl> <dbl> <dbl> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 6.44 6.12 3.8 ace young N 4.22 -1.25 0.693 0
## 2 6.38 6.14 5.53 age young N 8.40 0.971 5.05 3.97
## 3 6.29 6.14 4.37 aid young V 6.93 1.91 1.61 0.0188
## 4 6.54 6.14 3.27 aide young N 4.62 2.64 0.693 0
## 5 6.62 6.14 2.3 ail young V 4.92 3.64 1.10 0.856
## 6 6.38 6.11 4.1 aim young V 6.60 2.10 1.79 0.327
## 7 6.24 6.11 5.53 air young N 8.41 1.51 4.14 2.33
## 8 6.50 6.23 4.1 aisle young N 4.87 1.28 0.693 0
## 9 6.46 6.12 3.23 ale young N 4.04 0.364 1.10 0.198
## 10 6.37 6.11 3.33 ape young N 4.51 1.44 1.79 0.0421
## # … with 34 more rows, 25 more variables: InflectionalEntropy <dbl>,
## # NumberSimplexSynsets <dbl>, NumberComplexSynsets <dbl>,
## # LengthInLetters <int>, Ncount <int>, MeanBigramFrequency <dbl>,
## # FrequencyInitialDiphone <dbl>, ConspelV <int>, ConspelN <dbl>,
## # ConphonV <int>, ConphonN <dbl>, ConfriendsV <int>, ConfriendsN <dbl>,
## # ConffV <dbl>, ConffN <dbl>, ConfbV <dbl>, ConfbN <dbl>,
## # NounFrequency <int>, VerbFrequency <int>, CV <fct>, Obstruent <fct>, …
Now, let’s run different linear models and extract their \(R^2\).
# Let's create a column that will hold all the models
byFrication = byFrication %>%
mutate(model = map(data, ~ lm(RTlexdec ~ Familiarity, data = .)))
# The map() function applies the lm() function to all items in byFrication$data
# Now, let's extract the R2 and add them to a new column
# Below, we first create a column with info about the model, from which we will derive the R2 values
byFrication = byFrication %>%
mutate(
modelInfo = model %>% map(broom::glance),
R2 = modelInfo %>% map_dbl("r.squared")
)
# If you now take a look at the tibble, you'll see our new columns (note that tibbles can ommit certain columns; take a look at the bottom of the output to see which variables are not shown):
byFrication
## # A tibble: 8 × 6
## Frication Speaker data model modelInfo R2
## <fct> <fct> <list> <list> <list> <dbl>
## 1 long speaker_1 <tibble [44 × 35]> <lm> <tibble [1 × 12]> 0.560
## 2 short speaker_1 <tibble [490 × 35]> <lm> <tibble [1 × 12]> 0.462
## 3 burst speaker_1 <tibble [920 × 35]> <lm> <tibble [1 × 12]> 0.444
## 4 frication speaker_1 <tibble [830 × 35]> <lm> <tibble [1 × 12]> 0.431
## 5 long speaker_2 <tibble [44 × 35]> <lm> <tibble [1 × 12]> 0.425
## 6 short speaker_2 <tibble [490 × 35]> <lm> <tibble [1 × 12]> 0.347
## 7 burst speaker_2 <tibble [920 × 35]> <lm> <tibble [1 × 12]> 0.362
## 8 frication speaker_2 <tibble [830 × 35]> <lm> <tibble [1 × 12]> 0.344
# Finally, we can plot the R2 and see that for both speakers, the best models (re. R2) are the ones where Frication == long. Also, speaker 1 has a higher R2 across all types of Frication.
byFrication %>% ggplot(aes(x = R2, y = reorder(Frication, R2))) +
geom_point(aes(color = Speaker), size = 4) +
labs(y = "Frication", x = expression(R^2))
# Note that I have reordered the levels of Frication based on R2 values *in the plot*.
Read more about nested data frames here.
Sometimes, you have multiple files (data frames) to analyse. For
example, you may have a spreadsheet with information about your stimuli
and another spreadsheet with the actual output of the experiment. You
normally want to extract the results from the second data frame and
paste it in your first data frame. However, sometimes the row order is
different, so you not only need to copy and paste, you also have to
match the rows correctly. In Excel you use the
lookup()
function, for example.
As an example, let’s simulate two data frames: df1
and
df2
. Let’s say you want to add the origin and the age of
each speaker to df1
, but these two columns are in data
frame df2
. We’ll first use match()
, and then
do the same thing using dplyr()
. Let’s look at each data
frame.
library(dplyr)
set.seed(12)
df1 = data.frame(
speaker = rep(paste("speaker_", seq(1:3), sep = ""), each = 3),
RT = rnorm(9, mean = 0.5, sd = 0.05),
verb = rep(c("go", "travel", "make"), 3),
response = sample(c("correct", "incorrect"), size = 9, replace = TRUE)
)
df1
## # A tibble: 9 × 4
## speaker RT verb response
## <chr> <dbl> <chr> <chr>
## 1 speaker_1 0.426 go incorrect
## 2 speaker_1 0.579 travel incorrect
## 3 speaker_1 0.452 make correct
## 4 speaker_2 0.454 go incorrect
## 5 speaker_2 0.400 travel correct
## 6 speaker_2 0.486 make incorrect
## 7 speaker_3 0.484 go incorrect
## 8 speaker_3 0.469 travel incorrect
## 9 speaker_3 0.495 make correct
df2 = data.frame(
speaker = rep(paste("speaker_", seq(1:3), sep = ""), each = 1),
origin = rep(c("Canada", "US", "Italy"), each = 1),
age = rep(c(25, 37, 18), each = 1)
)
df2
## # A tibble: 3 × 3
## speaker origin age
## <chr> <chr> <dbl>
## 1 speaker_1 Canada 25
## 2 speaker_2 US 37
## 3 speaker_3 Italy 18
Naturally, these two data frames are equally arranged, which quite
often is not the case. To make things more realistic, let’s order
df1
by RT
, so that the data frames are not
ordered the same way.
df1 = df1 %>% arrange(RT)
copy_df1 = df1 # Create copy of df1 (see below)
df1
## # A tibble: 9 × 4
## speaker RT verb response
## <chr> <dbl> <chr> <chr>
## 1 speaker_2 0.400 travel correct
## 2 speaker_1 0.426 go incorrect
## 3 speaker_1 0.452 make correct
## 4 speaker_2 0.454 go incorrect
## 5 speaker_3 0.469 travel incorrect
## 6 speaker_3 0.484 go incorrect
## 7 speaker_2 0.486 make incorrect
## 8 speaker_3 0.495 make correct
## 9 speaker_1 0.579 travel incorrect
Now, to add origin
and age
to
df1
, you can use the match()
function:
# Adding the columns in df2 to df1 using match():
df1$origin = df2$origin[match(df1$speaker, df2$speaker)]
df1$age = df2$age[match(df1$speaker, df2$speaker)]
df1
## # A tibble: 9 × 6
## speaker RT verb response origin age
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 speaker_2 0.400 travel correct US 37
## 2 speaker_1 0.426 go incorrect Canada 25
## 3 speaker_1 0.452 make correct Canada 25
## 4 speaker_2 0.454 go incorrect US 37
## 5 speaker_3 0.469 travel incorrect Italy 18
## 6 speaker_3 0.484 go incorrect Italy 18
## 7 speaker_2 0.486 make incorrect US 37
## 8 speaker_3 0.495 make correct Italy 18
## 9 speaker_1 0.579 travel incorrect Canada 25
Now using dplyr()
: simply add left_join()
.
Let’s add the columns in df2
to copy_df1
.
new_df1 = copy_df1 %>% left_join(df2, by = "speaker")
new_df1
## # A tibble: 9 × 6
## speaker RT verb response origin age
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 speaker_2 0.400 travel correct US 37
## 2 speaker_1 0.426 go incorrect Canada 25
## 3 speaker_1 0.452 make correct Canada 25
## 4 speaker_2 0.454 go incorrect US 37
## 5 speaker_3 0.469 travel incorrect Italy 18
## 6 speaker_3 0.484 go incorrect Italy 18
## 7 speaker_2 0.486 make incorrect US 37
## 8 speaker_3 0.495 make correct Italy 18
## 9 speaker_1 0.579 travel incorrect Canada 25
Clearly, the dplyr
option is easier to read. You can
also use the data.table
package, which offers a very
flexible syntax.
Manipulating strings can be very useful when we analyse linguistic
data. Naturally, there are many tools for text mining (see resources
below), which can be great for creating corpora from text, for example.
In fact, the tm
package offers several interesting
functions that go beyond text mining, and which can certainly help you
‘clean’ your strings (e.g., getting rid of function words). Here, I will
focus on string manipulation, which is perhaps more
commonly used in linguistic data analysis among experimental/theoretical
linguists.
In particular, I will focus on three things: fundamental functions for exploring strings, bigrams, and regular expressions for pattern recognition.
Let’s get started with two objects (variables): a sentence and a
word. They share the same class (character
).
sentence = "Capybaras are very friendly"
word = "capybara"
Capybaras don’t normally look stressed
Let’s first focus on the variable sentence
and check
some basic functions.
toupper(sentence)
## [1] "CAPYBARAS ARE VERY FRIENDLY"
tolower(sentence)
## [1] "capybaras are very friendly"
Because most of the time you want to standardise your words, using
tolower()
may be quite useful.
Note that sentence
contains a single string, which is a
sentence (so you can’t really access the individual words that easily).
As a result, you will often want to extract each word from
Capybaras are very friendly
.
sentenceWords = strsplit(tolower(sentence), split = " ")[[1]]
sentenceWords
## [1] "capybaras" "are" "very" "friendly"
The function strsplit()
above has a very important
argument (split
), which tells R what defines the elements
we want to extract. Here, because we want words, we simply tell R to use
a space " "
. The second important thing here is
[[1]]
. This is what we’re doing: strsplit()
outputs a list, which here will only have one entry. Inside that entry,
we’ll find the individual words. If you don’t use [[1]]
above, you’ll need to type sentenceWords[[1]][1]
to access
the first word in the sentence. By including [[1]]
above,
you can simply type sentenceWords[1]
, because the variable
sentenceWords
is already inside the first entry.
If you’re familiar with Python, you will notice that accessing
the last segment in a string cannot be achieved by using -1
(this will in fact remove item 1; see below). A secong difference is
that we start couting from 1 in R.
sentenceWords[2] # Second word in the sentence
## [1] "are"
sentenceWords[length(sentenceWords)] # Last word
## [1] "friendly"
sentenceWords[-1] # This will remove the first item
## [1] "are" "very" "friendly"
sentenceWords[c(4,1)] # This will pick items 4 and 1
## [1] "friendly" "capybaras"
R offers some functions that concatenate variables and strings (e.g.,
cat()
, paste()
). For example:
paste("The variable 'sentenceWords' has", length(sentenceWords), "words")
## [1] "The variable 'sentenceWords' has 4 words"
paste("Some", sentenceWords[1], "make great pets.")
## [1] "Some capybaras make great pets."
paste("a", "b", sep = "-") # You can also add a separator
## [1] "a-b"
What if we want to know the length of each word in a given vector?
library(stringr)
for(i in 1:length(sentenceWords)){
print(
paste(
sentenceWords[i], ": ", str_length(sentenceWords[i]), "letters"
)
)
}
## [1] "capybaras : 9 letters"
## [1] "are : 3 letters"
## [1] "very : 4 letters"
## [1] "friendly : 8 letters"
# Alternatively (and more simply):
str_length(sentenceWords)
## [1] 9 3 4 8
I used str_length()
from stringr
package
above, but you could also use nchar()
(which requires no
packages)—read more about stringr
here.
What if you want to add a word to sentenceWords
? In
fact, what if you want to add any item to a vector…?
# Let's add another word and an NA to our sentence (vector)
sentenceWords[length(sentenceWords)+1] = "animals"
sentenceWords
## [1] "capybaras" "are" "very" "friendly" "animals"
How about our variable word
, defined above? We could
also split its characters:
wordLetters = strsplit(word, split = "")[[1]]
wordLetters
## [1] "c" "a" "p" "y" "b" "a" "r" "a"
You simply need to tell R to spit by… nothing: ""
. Also,
note that R actually knows the letters in the alphabet:
letters # All letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
sample(letters, 5) # Let's randomly sample 5 letters
## [1] "h" "k" "n" "g" "l"
n-grams can be very useful. As an example, let’s see how you could create your own function that returns the bigrams of a given word (this could naturally be extended to bigrams of a given sentence).
wdBigram = function(x){
temp = strsplit(x, split = "")[[1]]
lenTemp = length(temp)
bigrams = c()
# Word-initial bigram
first = paste("#", temp[1], sep = "")
bigrams[length(bigrams)+1] = first
# Word-internal bigrams
for(i in 1:(lenTemp-1)){
bigrams[length(bigrams)+1] = paste(temp[i], temp[i+1], sep = "")
}
# Word-final bigram
last = paste(temp[length(temp)], "#", sep = "")
bigrams[length(bigrams)+1] = last
return(bigrams)
}
Now let’s test it with our variable word
.
wdBigram(word)
## [1] "#c" "ca" "ap" "py" "yb" "ba" "ar" "ra" "a#"
Let’s go back to the english
data set and analyse our
bigrams there.
library(languageR)
data(english)
d = english
head(d$Word) # These are the words used in the experiment
## [1] doe whore stress pork plug prop
## 2197 Levels: ace act add age aid aide ail aim air aisle ale angst ant ape ... zoo
words = d$Word
Now that we assigned all the words in the column Word
in
the english
data frame to a new variable
(words
), we can collect the bigrams in each word and see
which bigrams are more common.
# Step 1: create an empty list
allBigrams = list()
# Step 2: create a for-loop to apply wdBigram to all words
for(word in words){
allBigrams[[length(allBigrams)+1]] = wdBigram(word)
}
# Step 3: unlist allBigrams
allBigrams = unlist(allBigrams)
head(allBigrams, 10)
## [1] "#d" "do" "oe" "e#" "#w" "wh" "ho" "or" "re" "e#"
Now, let’s take a look at the number of tokens per type:
allBigrams = as.data.frame(allBigrams) # Create a data frame
names(allBigrams) = "bigram" # Let's name the only column we have
count(group_by(allBigrams, bigram)) # These are all the bigrams we have:
## # A tibble: 361 × 2
## bigram n
## <chr> <int>
## 1 #a 48
## 2 #b 388
## 3 #c 458
## 4 #d 204
## 5 #e 34
## 6 #f 292
## 7 #g 274
## 8 #h 216
## 9 #i 14
## 10 #j 62
## # … with 351 more rows
# Because there's only one variable in allBigrams, we could simplify the count:
count(allBigrams, bigram)
## # A tibble: 361 × 2
## bigram n
## <chr> <int>
## 1 #a 48
## 2 #b 388
## 3 #c 458
## 4 #d 204
## 5 #e 34
## 6 #f 292
## 7 #g 274
## 8 #h 216
## 9 #i 14
## 10 #j 62
## # … with 351 more rows
This looks promising, but it’s still kind of messy.
# First, let's create another data frame with the output from xtabs
results = count(group_by(allBigrams, bigram))
# Now let's arrange it by the frequency of each bigram
results = arrange(results, desc(n))
# Finally, let's check the 10 most and least frequent bigrams
head(results, 10)
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 e# 1096
## 2 #s 830
## 3 t# 654
## 4 #c 458
## 5 #b 388
## 6 k# 366
## 7 #p 344
## 8 h# 344
## 9 d# 318
## 10 l# 306
tail(results, 10)
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 ws 2
## 2 wt 2
## 3 xe 2
## 4 xt 2
## 5 yi 2
## 6 yl 2
## 7 yn 2
## 8 yr 2
## 9 yt 2
## 10 za 2
# In fact, let's create a variable for the 15 most frequent bigrams:
popularBigrams = results[1:15,]
head(popularBigrams)
## # A tibble: 6 × 2
## bigram n
## <chr> <int>
## 1 e# 1096
## 2 #s 830
## 3 t# 654
## 4 #c 458
## 5 #b 388
## 6 k# 366
Finally, you could also calculate the bigram probability of each word by extending the function above (on my website you can find a function that does that).
An entire workshop could be devoted to regular expressions. This is an excellent introduction, which will show you the most fundamental aspects of regex in R. In what follows, I will simply show you a quick example.
What if you wanted to examine only word-internal bigrams? In other
words, you want to exclude all bigrams that contain #
. This
is an easy pattern to match using regular expressions.
wdInternal = results[!grepl("#", results$bigram), ]
head(wdInternal)
## # A tibble: 6 × 2
## bigram n
## <chr> <int>
## 1 st 304
## 2 ch 286
## 3 ea 260
## 4 in 232
## 5 re 226
## 6 ar 224
There is a family of functions in R dedicated to regular expressions.
In the example above, I use grepl()
to show me cases where
a given pattern is matched. It returns indices associated with such
cases. In other words, it tells me the position of the
patterns I’m looking for. As a result, if I don’t want such patterns, I
simply tell R to return rows where the pattern is not TRUE
by using !
. Only rows without #
will be
returned as a result.
But what if you want word-internal bigrams that only contain consonants? Perhaps you are interested in consonant clusters, for example. Note that we are working with orthography here.
notCons = c("a", "e", "i", "o", "u", "y", "w")
clusters = wdInternal[!grepl(paste(notCons, collapse = "|"), wdInternal$bigram), ]
head(clusters)
## # A tibble: 6 × 2
## bigram n
## <chr> <int>
## 1 st 304
## 2 ch 286
## 3 sh 194
## 4 ck 142
## 5 ll 130
## 6 th 128
First, we created a variable to hold all values we do not wish to
have in our bigrams. Then, we used paste()
to combine all
these segments using |
, which in regex means
or
(recall this is also used for subsetting). As a result,
grepl()
will look for a
OR e
OR
i
etc. If any of these segments is found, the output is
TRUE
. Because we used a !
, every case where
this evaluates to TRUE
will be ignored (i.e., removed).
What remains is a list with only consonants (also note that we’re using
wdInternal
as our data frame, so word-initial and -final
bigrams are not a problem anymore).
Working with dates may be important (e.g., time series analysis). If
you have longitudinal data, it might be useful to see trends as a
function of time, for example. But using dates as variables may be
tricky. Here, I’ll use as an example the registration form for this
workshop: I simply downloaded the form spreadsheet as a csv
file and opened it with R.
First, let’s look at what the csv
file looks like (I
have assigned it to the variable ws
). These are the first
six rows in the data frame. Note the Timestamp
column.
head(ws[,c("Timestamp", "Institution", "Field")])
## # A tibble: 6 × 3
## Timestamp Institution Field
## <chr> <chr> <chr>
## 1 8/19/2016 1:24:41 McGill Linguistics
## 2 8/19/2016 11:02:45 McGill Communication Sciences and Disorders
## 3 8/19/2016 14:00:01 McGill Communication Sciences and Disorders
## 4 8/19/2016 21:58:30 McGill Psychology
## 5 8/20/2016 7:45:15 McGill Linguistics
## 6 8/22/2016 9:39:28 McGill Communication Sciences and Disorders
Here’s what we want to do: we want a plot that shows the number of entries in the form by date. Below, I use a long pipeline to do the following:
Timestamp
)Date
Step 4 is important: when a given object is of type
Date
, R knows how to interpret it correctly (i.e., as a
date, and not, e.g., as a character or as a factor). In the code below,
you will see that I’m using scale_x_date()
as a
ggplot
layer. Within that layer, date_breaks
tells R the intervals I wish to use in the x-axis, whereas
date_labels
tells R how to actually display the date. For
example, if I’d used %A
, the x-axis would contain actual
day names (Sunday
, Monday
etc.). Read more
about dates in R here.
library(ggplot2)
library(stringr)
library(dplyr)
ws %>% select(Timestamp, Institution, Field) %>%
mutate(Date = str_replace(Timestamp, " .*", "")) %>%
select(-Timestamp) %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(Date) %>% summarise(n = n()) %>%
ggplot(aes(x=Date, y=n)) +
geom_line(aes(group=1)) +
geom_point(size=4) +
scale_x_date(date_breaks = "3 days", date_labels = "%b, %d") +
scale_y_continuous(breaks = 1:6) +
labs(x = "Date", y = "Number of registrations", title = "R workshop") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5))
Note that the actual data frame has a different date format, namely,
yyy-mm-dd
. Here’s what ws
looks like before we
plot it.
ws %>% select(Timestamp, Institution, Field) %>%
mutate(Date = str_replace(Timestamp, " .*", "")) %>%
select(-Timestamp) %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y"))
## # A tibble: 33 × 3
## Institution Field Date
## <chr> <chr> <date>
## 1 McGill Linguistics 2016-08-19
## 2 McGill Communication Sciences and Disorders 2016-08-19
## 3 McGill Communication Sciences and Disorders 2016-08-19
## 4 McGill Psychology 2016-08-19
## 5 McGill Linguistics 2016-08-20
## 6 McGill Communication Sciences and Disorders 2016-08-22
## 7 McGill Communication Sciences and Disorders 2016-08-22
## 8 Concordia Education 2016-08-22
## 9 Concordia Linguistics 2016-08-23
## 10 Concordia Education 2016-08-24
## # … with 23 more rows
Several html
widgets make it possible to combine R, html
and
Javascript
. One nice example is the networkD3
package, which allows R to make use of the D3
library. The bottom line is: R can be linked to several other nice
tools for data visualization, which may or may not be helpful, depending
on what data you want to analyse/visualize, and how you plan to
present your plots, for example. If you like dynamic and interative
(reactive) plots, then you’ll be constrained by online platforms (e.g.,
your website).
Here’s a very simple example: a network of the genealogy of Romance languages. For a more elaborate example, click here.
library(networkD3)
library(extrafont)
src = c("LATIN", "LATIN",
"Vulgar Latin", "Vulgar Latin",
"Continental Romance", "Continental Romance", "Continental Romance",
"Italo-Western", "Italo-Western",
"Eastern Romance", "Eastern Romance",
"Western Romance", "Western Romance",
"Proto-Italian",
"Balkan Romance",
"Ibero-Romance", "Ibero-Romance",
"Gallo-Romance", "Gallo-Romance",
"Proto-Romanian", "Proto-Romanian",
"Occitano-Romance", "Occitano-Romance")
target = c("Classical Latin", "Vulgar Latin",
"Continental Romance", "Sardinian Language",
"Italo-Western", "African Romance", "Eastern Romance",
"Western Romance", "Proto-Italian",
"Balkan Romance", "Dalmatian",
"Ibero-Romance", "Gallo-Romance",
"Italian",
"Proto-Romanian",
"Portuguese", "Spanish",
"Occitano-Romance", "French",
"Romanian", "Aromanian",
"Catalan", "Occitan")
romance = data.frame(src, target, stringsAsFactors = FALSE)
simpleNetwork(romance, opacity = 0.8,
zoom = FALSE,
linkDistance = 45,
nodeColour = c("#A5271D"),
fontSize = 15,
charge = -30,
fontFamily = "CMU Sans Serif",
width = "100%"
)
As usual, there are many different ways to plot ordinal data. One common way is to use stacked bars. As an example, let’s first create our data. Assume we have 10 participants with 10 observations each. They each had to rate the naturalness of a given structure on a 5-point scale. Finally, let’s say these 10 speakers belong to two different groups (for some reason).
First, let’s create our data frame, called df
here.
library(ggplot2)
library(dplyr)
library(RColorBrewer)
# STEP 1: create data
# gl() is used for creating factor levels
df = data.frame(
speaker = gl(n = 10, k = 10, labels = paste("speaker", seq(1:10), sep = " ")),
group = gl(n = 2, k = 50, labels = c("Group 1", "Group 2")),
response = as.ordered(sample(seq(1,5), 100, replace = TRUE))
)
# STEP 2: Plot responses
ggplot(data = df, aes(x = speaker, fill = response)) +
geom_bar(stat = "count") +
facet_grid(~group, scales = "free_x") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(y = "", x = "") +
scale_fill_brewer("Response", palette = "OrRd")
The ggplot
function above has some important details,
e.g., removing the y-axis ticks and labels (you may want to have them,
though). The final line in the plot tells R to use a particular colour
palette ("OrRd"
), which goes from orange to red. To see
what other palettes exist, type
RColorBrewer::display.brewer.all()
. Naturally, you may to
install the RColorBrewer
package if it’s not installed
yet.
A note on step 1 above: I generated responses using the
sample()
function, which randomly creates a sample (in this
case form the sequence that goes from 1 to 5).
Finally, two things you may find useful.
First: If you don’t want to remove y-axis labels and
ticks, you just have to delete the two respective lines of code from the
ggplot
function. You can also add a final layer to specify
which breaks you want to use along that axis. In fact, you may have
percentages by speaker instead of counts (here counts are easy to read
because there are exactly 10 items per speaker). My suggestion is to use
dplyr
to prepare the data before plotting it if that’s what
you want. The reason is that ggplot
will calculate
percentages over all facets by default, and fixing that is unnecessarily
complex. Preparing your data before plotting is a better idea here. Like
this:
df2 = df %>% group_by(speaker, response) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n))
Second: The legend in the plot above goes from 1 to
5 (top down), whereas the plot goes the opposite way. This may or may
not bother you. I personally prefer to make both legend and plot go the
same direction. To do that, you can pass another argument to the
scale_fill_brewer()
layer. Below, I add such an argument
(breaks
), and tell it to use as legend breaks the inverse
order of the levels in response
. Simply put,
rev(levels(df$response))
is equivalent to
c("5", "4", "3", "2", "1")
.
This solution is better than releveling the factor (remember this is an ordered factor). In sum, you just want to reorder it in the plot, not in the data.
PS: I’ve also added another argument to
scale_fill_brewer()
: direction
. If you use
1
here, you get the “higher = darker” colour pattern. If
you use -1
, then the higher end of the scale will be
lighter. 1
makes more sense to me here. To get more info on
function arguments, always see the documentation by typing
?
before a function (or help()
).
ggplot(data = df, aes(x = speaker, fill = response)) +
geom_bar(stat = "count") +
facet_grid(~group, scales = "free_x") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(y = "", x = "") +
scale_fill_brewer("Response", palette = "OrRd", direction = 1,
breaks = rev(levels(df$response))) +
scale_y_continuous(breaks = seq(from = 0, to = 11, by = 2)) # y-axis breaks
Sometimes you may want to know how much memory R objects use (mainly if you’re dealing with very large files at once). For example, you might want to know how large a given data frame is so you can anticipate the size of the file you will generate.
In Linguistics, we normally do not work with big data, so our files
tend to be very small compared to genetics, for instance. A
csv
file with more than 100Mb is relatively rare; text
files with 1Gb would be quite unusual in a typical lingusitics
experiment (remember that RData
is way better than
csv
in this respect as well).
To know how large an object is, you can use the pryr
package. There are at least two useful functions in the package:
object.size()
and mem_used()
. The former tells
you how large any object is: a data frame, a vector, a function, etc.
The latter tells you the memory being used by all the objects
in your workspace.
On top of that, you can use the format()
function to
define the unit of your output. I usually want the output in megabytes,
so:
# Let's create an object with 100,000 random numbers
testObject = rnorm(100000)
# Load package
library(pryr)
##
## Attaching package: 'pryr'
## The following objects are masked from 'package:purrr':
##
## compose, partial
object.size(testObject)
## 800048 bytes
format(object.size(testObject), units = "Mb")
## [1] "0.8 Mb"
# Let's check the size of some objects created during this tutorial:
object.size(clusters)
## 20096 bytes
object.size(d)
## 1483368 bytes
object.size(popularBigrams)
## 5816 bytes
# These are all the objects in our workspace:
objects()
## [1] "allBigrams" "badWords" "byFrication" "clusters"
## [5] "columns" "copy_df1" "d" "d2"
## [9] "dAgg" "dClean" "df" "df_NAs"
## [13] "df1" "df2" "dSum" "eng"
## [17] "english" "fricMean" "glmModel1" "glmModel2"
## [21] "group1" "group2" "i" "lModel1"
## [25] "lModel2" "long1" "long2" "long2a"
## [29] "long2b" "longData" "myAnova" "myAnova2"
## [33] "myFunction" "myNumber" "myNumber2" "myNumbers"
## [37] "new_df1" "newData" "newRTs" "notCons"
## [41] "noWar" "orderedRT" "orderedRT2" "outcome"
## [45] "person1" "person2" "person3" "popularBigrams"
## [49] "preds" "preds2" "probPreds" "propData"
## [53] "results" "romance" "sentence" "sentenceWords"
## [57] "someData" "src" "target" "testObject"
## [61] "var1" "var2" "var3" "wdBigram"
## [65] "wdInternal" "wide_again" "wide1" "wide2"
## [69] "word" "wordLetters" "words" "ws"
## [73] "x" "y" "young" "youngV"
## [77] "z"
Finally, to check how much space is being used by all these objects:
mem_used()
## 229 MB
I have included several useful links throughout the tutorial above. Some of them are repeated in this section.
ggplot2
in RI included above several links to tutorials and websites in general. Some of them are particularly helpful.
Make sure to check the following packages/tutorials if you are
interested in text/string manipulation. This may be very useful for
extracting word/sentence/phonotactic patterns from corpora, for example.
The tm
package is specifically designed for text mining
(hence the name).