R Basics

Intro: R objects

These are three very useful objects in R that were discussed in our first meeting.

A ………. is the equivalent to a spreadsheet in Excel. It almost always has two dimensions, namely, ………. and ………. This is what you will be working with most of the time.
Each ………. in a ………. is equivalent to a vector. As a result, a typical column contains only one type of data, i.e., a column cannot have numbers and strings (unlike rows): if it does, numbers will be coerced to as.character.
………. are one of the most flexible objects in R. They can have multiple dimensions as well as different types of data. As a result, you can hold both numbers and strings in single variable.
To create a vector, you use the ………. command. To create a data frame from scratch, you use ………. If you want to turn an object into a data frame (e.g., a matrix), you use ………. To create lists, you type ……….

(a) data frame; rows; columns
(b) column; data frame
(c) lists
(d) c(); data.frame(); as.data.frame(); list()

How would you create the following data frame from scratch and assign it to a new variable? (You won’t normally do that in R, but it’s good practice).

item    sentence                   type     condition   version       RT
1       The nurse was nervous      filler   pauseN2     a             4.628606
2       I walk every day           filler   fallingInt  b             3.510744
3       She never speaks Japanese  target   risingInt   a             2.851694


myData = data.frame(
  item = c(1,2,3),
  sentence = c("The nurse was nervous", "I walk every day", "She never speaks Japanese"),
  type = c("filler", "filler", "target"),
  condition = c("pauseN2", "fallintInt", "risingInt"),
  version = c("a", "b", "a"),
  RT = c(4.628606, 3.510744, 2.851694)
)

myData
##   item                  sentence   type  condition version       RT
## 1    1     The nurse was nervous filler    pauseN2       a 4.628606
## 2    2          I walk every day filler fallintInt       b 3.510744
## 3    3 She never speaks Japanese target  risingInt       a 2.851694

In myData, we anticipate that R will treat item as a number, not as a factor. As a result, if you use summary(myData) (or mean(myData$item)), R will return the mean for item, which makes no sense. Using item as an example, check its class and change it into a factor.


class(myData$item)
## [1] "numeric"

# So R will return the mean of the column:

mean(myData$item) # !
## [1] 2

myData$item = as.factor(myData$item)

# Let's check that item is a factor now:

class(myData$item)
## [1] "factor"

mean(myData$item) # Now we won't be able to calculate the mean, of course
## Warning in mean.default(myData$item): argument is not numeric or logical:
## returning NA
## [1] NA

Packages

How do you install and load a package in R?


# In this example, I'm explicitly telling R which repository to use:
install.packages("nameOfPackage", repos = "http://cran.r-project.org")

# To load a package:

library("nameOfPackage")

# Alternatively:

require("nameOfPackage")

# Avoid using require(): library() is preferred

# PS: You don't actually need quotes when loading packages

Loading your own (hypothetical) data

You have just opened R. How do you check which objects are loaded in your workspace?

ls()

Now you want to load your data file, myFile.csv. Assume it’s located in a particular folder in your laptop: /Users/yourName/Documents/files/myFile.csv. Which command(s) could you use to load the file and assign it to the variable myData?

# Option 1:

setwd("/Users/yourName/Documents/files/myFile.csv")
myData = read.csv("myFile.csv")

# Option 2:

myData = read.csv("/Users/yourName/Documents/files/myFile.csv/myFile.csv")

Working with real data

We will use the danish data set, which comes with the languageR package (click here for more info). Load the data and assign danish to a new (shorter) variable, dan.

library(languageR)

data(danish)

dan = danish

This question has three parts. Normally, the first thing you want to do when you load your data file is to have a general sense of its structure and dimensions (so you know the file is correct, for example). How do you: (a) visualize the first 10 rows in your data? (b) check the class of each variable? (c) print basic stats for all variables?

head(dan, n = 10)

str(dan)

summary(dan)

A couple of things: (a) How do you print the number of columns dan has? (b) How do you print the names of all the columns? Finally, (c) create a subset that only contains the following columns: Subject, Word, LogRT, Sex, LogWordFreq, LogUP. Assign this subset to new and visualize the first rows of new.

# (a)
ncol(dan)
## [1] 16

# (b)
names(dan)
##  [1] "Subject"        "Word"           "Affix"          "LogRT"         
##  [5] "PC1"            "PC2"            "PrevError"      "Rank"          
##  [9] "Sex"            "ResidSemRating" "ResidFamSize"   "LogWordFreq"   
## [13] "LogAffixFreq"   "LogCUP"         "LogUP"          "LogCUPtoEnd"

# (c)

newCols = c("Subject", "Word", "LogRT", "Sex", "LogWordFreq", "LogUP")

new = dan[, newCols] 

# Alternatively: 
# new = subset(dan, select = c("Subject", "Word", "LogRT", "Sex", "LogWordFreq", "LogUP"))

head(new)
##   Subject       Word    LogRT Sex LogWordFreq   LogUP
## 1    2s14 appetitlig 6.454239   M    2.944439 5.32301
## 2    2s17 appetitlig 6.842854   M    2.944439 5.32301
## 3    2s15 appetitlig 6.839958   M    2.944439 5.32301
## 4    2s04 appetitlig 6.834507   M    2.944439 5.32301
## 5    2s06 appetitlig 6.795191   F    2.944439 5.32301
## 6    2s11 appetitlig 7.062680   M    2.944439 5.32301

Now that we have a simpler data frame, export new as a csv file named output.

write.csv(new, file = "output.csv", row.names = FALSE)

Note that we have a column for word frequency (LogWordFreq), which has been log-transformed. To backtransform it, we can take the exponential of LogWordFreq using the exp() function. Create a new column called WordFreq that backtransforms LogWordFreq.

new$WordFreq = exp(new$LogWordFreq)

head(new)

##   Subject       Word    LogRT Sex LogWordFreq   LogUP WordFreq
## 1    2s14 appetitlig 6.454239   M    2.944439 5.32301       19
## 2    2s17 appetitlig 6.842854   M    2.944439 5.32301       19
## 3    2s15 appetitlig 6.839958   M    2.944439 5.32301       19
## 4    2s04 appetitlig 6.834507   M    2.944439 5.32301       19
## 5    2s06 appetitlig 6.795191   F    2.944439 5.32301       19
## 6    2s11 appetitlig 7.062680   M    2.944439 5.32301       19

One very useful function in R is ifelse(), which has three arguments. The first argument is the condition; the second, the result in case the condition is met; the third refers to what needs to be done if the condition is not met (i.e., the else bit).

For example:

x = 10
ifelse(x > 5, "x is greater than 5", "x is not greater than 5")
## [1] "x is greater than 5"

Because x == 10, the first argument evaluates to TRUE. As a result, the second argument is printed (the third argument is not evaluated in this case).

Now, create a new column in new called isFreq. This column will have two levels: yes if the log-transformed frequency of a word is greater than 5, and no otherwise. Note that R will likely think your new column is a character, not a factor. So you also need to transform it (you can actually do it all at once by embedding functions).

new$isFreq = as.factor(
  ifelse(new$LogWordFreq > 5, "yes", "no")
  )

head(new)
##   Subject       Word    LogRT Sex LogWordFreq   LogUP WordFreq isFreq
## 1    2s14 appetitlig 6.454239   M    2.944439 5.32301       19     no
## 2    2s17 appetitlig 6.842854   M    2.944439 5.32301       19     no
## 3    2s15 appetitlig 6.839958   M    2.944439 5.32301       19     no
## 4    2s04 appetitlig 6.834507   M    2.944439 5.32301       19     no
## 5    2s06 appetitlig 6.795191   F    2.944439 5.32301       19     no
## 6    2s11 appetitlig 7.062680   M    2.944439 5.32301       19     no

# Now let's see how many `yes` and `no` we have

summary(new$isFreq)
##   no  yes 
## 1781 1545

Intro to data analysis using R: Practice [1]

Guilherme D. Garcia

guilhermegarcia.github.io

R Basics

Intro: R objects

Packages

Loading your own (hypothetical) data

Working with real data