Created: May, 2022 Last updated: August
05, 2022
This is a simple tutorial showing how you can use R to extract
information from a website. More specifically, we’ll be extracting
linguistic data (Polish) from Wikipedia. You will need to have two
packages installed: tidyverse
and rvest
. In
addition, you need to have some familiarity with HTML, CSS, and regular
expressions.
Understanding the structure of a page is essential in web scraping.
You will always need to access the page source when you’re planning your
task, since you must know where on the page a particular element is. If
you’ve never done that, simply google “access page source in X”, where X
is your browser. You will probably want to google how to “copy CSS or
xpath
from page”, because this will tell you the “address”
of a particular object on a webpage.
Step 1: Our task
On this page, we have a list of 5000 Polish words (along with their frequency), which you can see in the figure on the left below (click to enlarge). If you click on any given word, you are taken to an entry (figure on the right) where, in most cases, you have access to the IPA transcription of the word.
Our task involves creating a table with three columns:
Word
, IPA
, and Frequency
. We can
find the information for Word
and Frequency
on
a single page (figure on the left above), but to extract the IPA for
each word, we will need to “visit” every link for every word, find where
the transcription is, and extract it.
We start by loading our packages and importing our page of interest. The code below extracts the nodes we care about. We then use regular expressions to isolate the word and its frequency for one entry.
library(rvest)
library(tidyverse)
# Main URL we will be using
= "https://en.wiktionary.org/"
mainURL
# Read wordlist
= read_html("https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Polish_wordlist")
pol
# Create empty tibble
= tibble(Word = rep(NA, 5000),
polish IPA = rep(NA, 5000),
Freq = rep(NA, 5000))
# Select frequencies
= pol %>%
s0 html_nodes("li")
# Select only word entries
# This contains both words and frequencies
= s0[1:5000]
s0
# Select all words (no frequencies)
= pol %>%
s1 html_nodes("span") %>%
html_nodes("a")
# Remove top rows (not words)
= s1[4:5003]
s1
# Visualize first word to check our code:
str_extract(s1[1], ">\\w*<") %>%
str_remove_all(">|<")
## [1] "nie"
# Visualize first frequency:
str_extract(s0[1], "\\s[:digit:]+") %>%
str_remove_all("\\s")
## [1] "6332218"
Now we know how to do two thirds of our task for a single word. You should never start scaling up the task before you’re absolutely sure it works for a single element. Next, we need to get the pronunciation of a word. This is harder because it involves accessing an additional page.
The code below extracts the IPA transcription of the first word in our list. It already includes some bug fixes:
Both issues above are addressed in the code (NA
is
added). However, there is one problem the code doesn’t completely fix.
The page for a given word is not exclusive for Polish. For
example, if a word happens to exist in multiple languages (see figure on
the right above), it will still only have a single page, with sections
for each language. In other words, when you visit a page for any given
word, it’s therefore possible to find multiple IPA transcriptions. How
can we pick only the Polish one? This shouldn’t be a major
issue, but the way the page was build makes it very hard to target only
IPA transcription from Polish (given the hieararchical structure of the
page).
The not-so-great solution in the code is actually very simple, and makes use of the alphabetical order in which language entries (and their transcriptions) appear on any page here: just pick the last IPA transcription you can find. This will not always work. If, for example, a word exists in Polish and in Portuguese, we will be picking the one in Portuguese. This will be rare, though.
= str_split(s1[1], " ")[[1]][2]
ipa0 = str_sub(ipa0, start = 8, end = -2L)
ipa1 = str_c(mainURL, ipa1)
ipa2
# If page doesn't exist, skip to next word:
if(str_detect(ipa2, "redlink")){
next
}= read_html(ipa2)
ipa3 = ipa3 %>%
ipa4 html_nodes(".IPA")
# If node is empty (i.e., no IPA), skip to next word:
if(is_empty(ipa4)){
next
}
# Pick only IPAs with // or []:
= ipa4[str_detect(ipa4, pattern = ">\\[|\\/<")]
ipa4
if(is_empty(ipa4)){
next
}
# Pick last IPA (see text above):
= ipa4[length(ipa4)]
ipa4 = str_extract(ipa4, "(\\[|/)\\w*(\\.\\w*)*(\\]|/)")
ipa5
= str_replace_all(ipa5, "/|\\[|\\]", "")
ipa6
ipa6
## [1] "ˈnie"
The code below combines everything into a for-loop, so we can fill in
the table we created above with all three variables of interest. The end
of the code makes class adjustments and saves the final output as an
RData
file. Bear in mind that this may take 10–20 minutes
to run, depending on your computer.
rm(list=ls())
library(rvest)
library(tidyverse)
= "https://en.wiktionary.org/"
mainURL
= read_html("https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Polish_wordlist")
pol
= tibble(Word = rep(NA, 5000),
polish IPA = rep(NA, 5000),
Freq = rep(NA, 5000))
# Select frequencies
= pol %>%
s0 html_nodes("li")
# Select only word entries
= s0[1:5000]
s0
# Select all words
= pol %>%
s1 html_nodes("span") %>%
html_nodes("a")
s1
# Remove top rows (not words)
= s1[4:5003]
s1
# Loop to add words:
for(i in 1:nrow(polish)){
# Pick word
= str_extract(s1[i], ">\\w*<") %>%
word str_remove_all(">|<")
$Word[i] = word
polish
# Extract IPA
= str_split(s1[i], " ")[[1]][2]
ipa0 = str_sub(ipa0, start = 8, end = -2L)
ipa1 = str_c(mainURL, ipa1)
ipa2 # If page doesn't exist, skip to next word:
if(str_detect(ipa2, "redlink")){
next
}= read_html(ipa2)
ipa3 = ipa3 %>%
ipa4 html_nodes(".IPA")
# If node is empty (i.e., no IPA), skip to next word:
if(is_empty(ipa4)){
next
}
# Pick only IPAs with // or []:
= ipa4[str_detect(ipa4, pattern = ">\\[|\\/<")]
ipa4
if(is_empty(ipa4)){
next
}
# Pick last IPA:
= ipa4[length(ipa4)]
ipa4 = str_extract(ipa4, "(\\[|/)\\w*(\\.\\w*)*(\\]|/)")
ipa5
= str_replace_all(ipa5, "/|\\[|\\]", "")
ipa6
$IPA[i] = ipa6
polish
# Extract frequencies
$Freq[i] = str_extract(s0[i], "\\s[:digit:]+") %>%
polishstr_remove_all("\\s")
}
polishtail(polish)
= polish %>%
polish mutate(across(where(is.character), as.factor))
# Save RData:
save(polish, file = "Polish.RData")
You can download the file here. Below you can see a sample of 10 rows from our table.
Word | IPA | Freq |
---|---|---|
postać | ˈpɔ.stat͡ɕ | 3324 |
telewizji | tɛ.lɛˈviz.ji | 10316 |
żyją | ˈʐɨ.jɔw̃ | 10888 |
czegoś | ˈt͡ʂɛ.ɡɔɕ | 51618 |
ładny | ˈwad.nɨ | 8490 |
wyjdziesz | ˈvɨj.d͡ʑɛʂ | 7322 |
odsunąć | ɔtˈsu.nɔɲt͡ɕ | 3770 |
milion | milǐoːn | 8551 |
porno | ˈpor.no | 3618 |
myśleliśmy | mɨɕ.lɛˈliɕ.mɨ | 4463 |
Copyright © 2022 Guilherme Duarte Garcia