Created: May, 2022 Last updated: August 05, 2022

This is a simple tutorial showing how you can use R to extract information from a website. More specifically, we’ll be extracting linguistic data (Polish) from Wikipedia. You will need to have two packages installed: tidyverse and rvest. In addition, you need to have some familiarity with HTML, CSS, and regular expressions.

Understanding the structure of a page is essential in web scraping. You will always need to access the page source when you’re planning your task, since you must know where on the page a particular element is. If you’ve never done that, simply google “access page source in X”, where X is your browser. You will probably want to google how to “copy CSS or xpath from page”, because this will tell you the “address” of a particular object on a webpage.


Step 1: Our task

On this page, we have a list of 5000 Polish words (along with their frequency), which you can see in the figure on the left below (click to enlarge). If you click on any given word, you are taken to an entry (figure on the right) where, in most cases, you have access to the IPA transcription of the word.

Our task involves creating a table with three columns: Word, IPA, and Frequency. We can find the information for Word and Frequency on a single page (figure on the left above), but to extract the IPA for each word, we will need to “visit” every link for every word, find where the transcription is, and extract it.

Step 2: Words and frequencies

We start by loading our packages and importing our page of interest. The code below extracts the nodes we care about. We then use regular expressions to isolate the word and its frequency for one entry.

library(rvest)
library(tidyverse)

# Main URL we will be using
mainURL = "https://en.wiktionary.org/"

# Read wordlist
pol = read_html("https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Polish_wordlist")

# Create empty tibble
polish = tibble(Word = rep(NA, 5000),
                IPA = rep(NA, 5000),
                Freq = rep(NA, 5000))

# Select frequencies
s0 = pol %>% 
  html_nodes("li")

# Select only word entries
# This contains both words and frequencies
s0 = s0[1:5000]

# Select all words (no frequencies)
s1 = pol %>% 
  html_nodes("span") %>% 
  html_nodes("a")

# Remove top rows (not words)
s1 = s1[4:5003]

# Visualize first word to check our code:
str_extract(s1[1], ">\\w*<") %>% 
    str_remove_all(">|<")
## [1] "nie"
# Visualize first frequency:
str_extract(s0[1], "\\s[:digit:]+") %>% 
    str_remove_all("\\s")
## [1] "6332218"

Now we know how to do two thirds of our task for a single word. You should never start scaling up the task before you’re absolutely sure it works for a single element. Next, we need to get the pronunciation of a word. This is harder because it involves accessing an additional page.

Step 3: Transcriptions

The code below extracts the IPA transcription of the first word in our list. It already includes some bug fixes:

Both issues above are addressed in the code (NA is added). However, there is one problem the code doesn’t completely fix. The page for a given word is not exclusive for Polish. For example, if a word happens to exist in multiple languages (see figure on the right above), it will still only have a single page, with sections for each language. In other words, when you visit a page for any given word, it’s therefore possible to find multiple IPA transcriptions. How can we pick only the Polish one? This shouldn’t be a major issue, but the way the page was build makes it very hard to target only IPA transcription from Polish (given the hieararchical structure of the page).

The not-so-great solution in the code is actually very simple, and makes use of the alphabetical order in which language entries (and their transcriptions) appear on any page here: just pick the last IPA transcription you can find. This will not always work. If, for example, a word exists in Polish and in Portuguese, we will be picking the one in Portuguese. This will be rare, though.

ipa0 = str_split(s1[1], " ")[[1]][2]
ipa1 = str_sub(ipa0, start = 8, end = -2L)
ipa2 = str_c(mainURL, ipa1)

# If page doesn't exist, skip to next word:
if(str_detect(ipa2, "redlink")){
  next
}
ipa3 = read_html(ipa2)
ipa4 = ipa3 %>% 
  html_nodes(".IPA")

# If node is empty (i.e., no IPA), skip to next word:
if(is_empty(ipa4)){
  next
}

# Pick only IPAs with // or []:
ipa4 = ipa4[str_detect(ipa4, pattern = ">\\[|\\/<")]

if(is_empty(ipa4)){
  next
}

# Pick last IPA (see text above):
ipa4 = ipa4[length(ipa4)]
ipa5 = str_extract(ipa4, "(\\[|/)\\w*(\\.\\w*)*(\\]|/)")

ipa6 = str_replace_all(ipa5, "/|\\[|\\]", "")

ipa6
## [1] "ˈnie"

Step 4: Combining everything

The code below combines everything into a for-loop, so we can fill in the table we created above with all three variables of interest. The end of the code makes class adjustments and saves the final output as an RData file. Bear in mind that this may take 10–20 minutes to run, depending on your computer.

rm(list=ls())
library(rvest)
library(tidyverse)

mainURL = "https://en.wiktionary.org/"

pol = read_html("https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Polish_wordlist")

polish = tibble(Word = rep(NA, 5000),
                IPA = rep(NA, 5000),
                Freq = rep(NA, 5000))

# Select frequencies
s0 = pol %>% 
  html_nodes("li")

# Select only word entries
s0 = s0[1:5000]

# Select all words
s1 = pol %>% 
  html_nodes("span") %>% 
  html_nodes("a")

s1

# Remove top rows (not words)
s1 = s1[4:5003]

# Loop to add words:
for(i in 1:nrow(polish)){
  
  # Pick word
  word = str_extract(s1[i], ">\\w*<") %>% 
    str_remove_all(">|<")
  
  polish$Word[i] = word

  # Extract IPA
  ipa0 = str_split(s1[i], " ")[[1]][2]
  ipa1 = str_sub(ipa0, start = 8, end = -2L)
  ipa2 = str_c(mainURL, ipa1)
  # If page doesn't exist, skip to next word:
    if(str_detect(ipa2, "redlink")){
    next
  }
  ipa3 = read_html(ipa2)
  ipa4 = ipa3 %>% 
    html_nodes(".IPA")
  # If node is empty (i.e., no IPA), skip to next word:
  if(is_empty(ipa4)){
    next
  }
  
  # Pick only IPAs with // or []:
  ipa4 = ipa4[str_detect(ipa4, pattern = ">\\[|\\/<")]
  
  if(is_empty(ipa4)){
    next
  }
  
  # Pick last IPA:
  ipa4 = ipa4[length(ipa4)]
  ipa5 = str_extract(ipa4, "(\\[|/)\\w*(\\.\\w*)*(\\]|/)")
  
  ipa6 = str_replace_all(ipa5, "/|\\[|\\]", "")
  
  polish$IPA[i] = ipa6
  
  # Extract frequencies
  polish$Freq[i] = str_extract(s0[i], "\\s[:digit:]+") %>% 
    str_remove_all("\\s")
  
}

polish
tail(polish)

polish = polish %>% 
  mutate(across(where(is.character), as.factor))
# Save RData:
save(polish, file = "Polish.RData")

You can download the file here. Below you can see a sample of 10 rows from our table.

Word IPA Freq
postać ˈpɔ.stat͡ɕ 3324
telewizji tɛ.lɛˈviz.ji 10316
żyją ˈʐɨ.jɔw̃ 10888
czegoś ˈt͡ʂɛ.ɡɔɕ 51618
ładny ˈwad.nɨ 8490
wyjdziesz ˈvɨj.d͡ʑɛʂ 7322
odsunąć ɔtˈsu.nɔɲt͡ɕ 3770
milion milǐoːn 8551
porno ˈpor.no 3618
myśleliśmy mɨɕ.lɛˈliɕ.mɨ 4463


Copyright © 2022 Guilherme Duarte Garcia