class: title-slide, inverse, center, middle # Fonology ## Phonological Analysis in R ### Guilherme D. Garcia <a href = "https://gdgarcia.ca" style="color: #FEC20B">
</a> <img src="ULaval2.png" alt="Université Laval" style="width:7%"> #### Université Laval • CRBLM • CRIHN --- ## What is it? 🤔 <img src="fonology.png" alt="Fonology" style="width:15%; float: right"> - An R package to help phonologists automate certain tasks - Currently under development; frequent updates - **This presentation**: demo of main functions -- ## Goal > Automate data prep for phonological analysis -- ## Content - **Research**: speed and precision (scalability) - **Teaching**: accessibility and interactivity --- ## Installation etc. 🧐 - Visit [gdgarcia.ca/fonology](https://gdgarcia.ca/fonology) for detailed info - To install the package: ```r library(devtools) # install.packages("devtools") install_github("guilhermegarcia/fonology") ``` -- ## Feedback, bugs, questions 🪲 - Go here: [github.com/guilhermegarcia/fonology/issues](https://github.com/guilhermegarcia/fonology/issues) -- ## Assumption of this presentation - You know some basic R and are familiar with `tidyverse` --- ## Road map 🗺️ ### Demo of main functions: 1. <h-l>Phonemic transcription</h-l> -- 2. Extraction of stress/syllables, syllabic constituents -- 3. Sonority -- 4. Vowel trapezoids -- 5. Natural classes -- 6. Word generator (pt) + phonotactic probability -- 7. Vowel formants + `ggplot2` (simple *wrapper*) -- 8. From IPA to TIPA -- *** <br> > *Little can be done without grapheme-phoneme conversion* <br> - <h-l>Written data</h-l>: easy to find and collect, difficult to analyze - Phonemic transcription: **essential** starting point --- class: inverse, center, middle # Main functions --- ## Example 1: broad transcription - `ipa_pt(...)`: phonemic transcription for Portuguese ```r library(Fonology) ipa_pt("concentração") #> [1] "kon.sen.tɾa.ˈsãw̃" ipa_pt("tipos") #> [1] "ˈti.pos" ipa_pt("quiséssemos") #> [1] "ki.ˈzɛ.se.mos" ipa_pt("parangaricutirrimirruaro") #> [1] "pa.ɾan.ga.ɾi.ku.ti.xi.mi.xu.ˈa.ɾo" ``` -- - **Not vectorized** (i.e., serial): ideal for *one* input - <h-l>Advantage</h-l>: probabilistic stress assignment, useful for nonce words --- ## Example 2: narrow transcription - `ipa_pt(..., narrow = T)` ```r ipa_pt("concentração", narrow = T) #> [1] "kõn.ˌsẽj̃ɲ.tɾa.ˈsãw̃" ipa_pt("tipos", narrow = T) #> [1] "ˈt͡ʃi.pʊs" ipa_pt("quiséssemos", narrow = T) #> [1] "ki.ˈzɛ.se.mʊs" ipa_pt("parangaricutirrimirruaro", narrow = T) #> [1] "pa.ˌɾãŋ.ga.ˌɾi.ku.ˌt͡ʃi.xi.ˌmi.xu.ˈa.ɾʊ" ``` - **Not vectorized** (i.e., serial): ideal for *one* input - <h-l>Advantage</h-l>: probabilistic stress assignment, useful for nonce words --- ## Example 3: transcription *en masse* - <h-l>Crucial</h-l>: being able to transcribe a lot of words *quickly* - `ipa(...)`: vectorized (<h-l>Portuguese and Spanish</h-l>) ```r ipa(word = c("Example", "com", "múltiplas", "palavras")) #> [1] "e.ˈzam.ple" "ˈkom" "ˈmul.ti.plas" "pa.ˈla.vɾas" ``` -- - Narrow transcription also available (for Portuguese only): ```r ipa(word = c("Encontramos", "transcrição", "fonética", "fina", "também"), narrow = T) #> [1] "ˌẽj̃ɲ.kõn.ˈtɾã.mʊs" "ˌtɾãns.kɾi.ˈsãw̃" "fo.ˈnɛ.t͡ʃi.kɐ" #> [4] "ˈfĩ.nɐ" "tãm.ˈbẽj̃ɲ" ``` - **Vectorized** function (i.e., parallel): ideal for a lot of data - <h-l>Advantage</h-l>: speed (*but* stress is assigned **categorically**) --- ## Example 4: short text 💬 - `ipa()` requires tokenized inputs - What if our input is a text...? - <mark>`cleanText()`</mark>: data cleaning and tokenization ```r library(tidyverse) # Sample sentence in Portuguese with some weird stuff in it: text = "Este é um teXto 123# bastante cUrto que Não está tokenizado" text |> * cleanText() |> ipa() #> [1] "ˈes.te" "ˈɛ" "ˈum" "ˈtes.to" #> [5] "bas.ˈtan.te" "ˈkuɾ.to" "ˈke" "ˈnãw̃" #> [9] "es.ˈta" "to.ke.ni.ˈza.do" ``` --- ## Example 5: from short text to tibble 💬 - We often work with *data frames* or *tibbles* ```r text = "Este é um teXto 123# bastante cUrto que Não está tokenizado" d = tibble(word = text |> cleanText()) |> # Words mutate(ipa = word |> ipa()) # IPA ``` -- <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> word </th> <th style="text-align:left;"> ipa </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> este </td> <td style="text-align:left;"> ˈes.te </td> </tr> <tr> <td style="text-align:left;"> é </td> <td style="text-align:left;"> ˈɛ </td> </tr> <tr> <td style="text-align:left;"> um </td> <td style="text-align:left;"> ˈum </td> </tr> <tr> <td style="text-align:left;"> texto </td> <td style="text-align:left;"> ˈtes.to </td> </tr> <tr> <td style="text-align:left;"> bastante </td> <td style="text-align:left;"> bas.ˈtan.te </td> </tr> <tr> <td style="text-align:left;"> curto </td> <td style="text-align:left;"> ˈkuɾ.to </td> </tr> </tbody> </table> --- ## Example 6: from long text to tibble 📚 ### Task 1. Import *Os Lusíadas*, clean and tokenize text 2. Transcribe, syllabify, stress lexical words 3. Extract stress and final syllable 4. Extract final syllable constituents -- - `getStress()`: extracts stress from transcription - `getWeight()`: extracts weight profile (e.g., `LLH`) - `getSyl()`: extracts a given syllable - `syllable()`: extracts syllabic constituents - `stopwords_pt` and `stopwords_sp`: stopwords (adapted from `tm` package) --- ## Example 6: from long text to tibble 📚 .panelset[ .panel[.panel-name[Code] ```r lus1 = read_lines("lusiadas.txt") lus2 = lus1 |> cleanText() |> # data cleaning + tokenization as_tibble() |> rename(word = value) |> filter(!word %in% stopwords_pt) |> # stopword removal mutate(ipa = ipa(word), # column for transcription stress = getStress(ipa), # column for stress weight = getWeight(ipa), # column for weight finSyl = getSyl(word = ipa, pos = 1), # column for final syllable onsetFin = syllable(finSyl, const = "onset"), nucFin = syllable(finSyl, const = "nucleus"), codaFin = syllable(finSyl, const = "coda"), rhFin = syllable(finSyl, const = "rhyme")) ``` ] .panel[.panel-name[Result] - Total number of lexical words 32618 (⏳ **< 2s**) - *Tidy data* format ready for analysis <br> <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> word </th> <th style="text-align:left;"> ipa </th> <th style="text-align:left;"> stress </th> <th style="text-align:left;"> weight </th> <th style="text-align:left;"> finSyl </th> <th style="text-align:left;"> onsetFin </th> <th style="text-align:left;"> nucFin </th> <th style="text-align:left;"> codaFin </th> <th style="text-align:left;"> rhFin </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> armas </td> <td style="text-align:left;"> ˈaɾ.mas </td> <td style="text-align:left;"> penult </td> <td style="text-align:left;"> HH </td> <td style="text-align:left;"> mas </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> s </td> <td style="text-align:left;"> as </td> </tr> <tr> <td style="text-align:left;"> barões </td> <td style="text-align:left;"> ba.ˈɾõj̃s </td> <td style="text-align:left;"> final </td> <td style="text-align:left;"> LH </td> <td style="text-align:left;"> ɾõj̃s </td> <td style="text-align:left;"> ɾ </td> <td style="text-align:left;"> õj̃ </td> <td style="text-align:left;"> s </td> <td style="text-align:left;"> õj̃s </td> </tr> <tr> <td style="text-align:left;"> assinalados </td> <td style="text-align:left;"> a.si.na.ˈla.dos </td> <td style="text-align:left;"> penult </td> <td style="text-align:left;"> LLH </td> <td style="text-align:left;"> dos </td> <td style="text-align:left;"> d </td> <td style="text-align:left;"> o </td> <td style="text-align:left;"> s </td> <td style="text-align:left;"> os </td> </tr> <tr> <td style="text-align:left;"> ocidental </td> <td style="text-align:left;"> o.si.den.ˈtal </td> <td style="text-align:left;"> final </td> <td style="text-align:left;"> LHH </td> <td style="text-align:left;"> tal </td> <td style="text-align:left;"> t </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> l </td> <td style="text-align:left;"> al </td> </tr> <tr> <td style="text-align:left;"> praia </td> <td style="text-align:left;"> ˈpɾa.ja </td> <td style="text-align:left;"> penult </td> <td style="text-align:left;"> LL </td> <td style="text-align:left;"> ja </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> ja </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> ja </td> </tr> </tbody> </table> ] .panel[.panel-name[Glides?] - Are glides onsets, codas, or are they in the nucleus...❓ - The function `syllable()` allows us to adjust our assumptions: <br> <div align = "center"> <img src="syllable.png" alt="syllable" style="width:80%;"> </div> ] ] --- ## Example 7: sonority 🔉 - `demi(word = ..., d = ...)`: extraction of demisyllables (`d = 1` or `d = 2`) ```r syllables = c("kom", "sil", "fran", "klas") syllables |> demi(d = 1) # get first demisyllable #> [1] "ko" "si" "fra" "kla" ``` -- - We can also calculate the average sonority dispersion of a vector with `meanSonDisp()`: ```r syllables |> demi(d = 1) |> meanSonDisp() #> [1] 2.67 ``` - **Note**: The function assumes 17 levels of sonority (see Parker 2011)<sup>1</sup> .footnote[[1] Parker, S. (2011). Sonority. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), *The Blackwell companion to phonology* (pp. 1160–1184). Wiley Online Library. https://doi.org/10.1002/9781444335262.wbctp0049] --- ## Example 8: sonority 🔊 - When teaching phonology, it may be useful to visualize sonority levels: .pull-left[ ```r "combradol" |> ipa() |> * plotSon(syl = F) ``` <img src="index_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r "sobremesa" |> ipa(lg = "Spanish") |> * plotSon(syl = T) ``` <img src="index_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] --- ## Example 9: vowel trapezoids - `plotVowels()` generates trapezoids for some languages - It also exports the <mark>`tex`</mark> file for those who use `\(\LaTeX\)` .pull-left[ ```r plotVowels(lg = "Spanish", * tex = F) ``` <img src="index_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r plotVowels(lg = "Italian", * tex = F) ``` <img src="index_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] --- ## Example 10: natural classes and distinctive features - From phonemes to traits `\(\rightarrow\)` `getFeat()`: ```r getFeat(ph = c("i", "u"), lg = "English") #> [1] "+hi" "+tense" getFeat(ph = c("p", "b"), lg = "French") #> [1] "-son" "-cont" "+lab" getFeat(ph = c("i", "y", "u"), lg = "French") #> [1] "+syl" "+hi" ``` -- - From traits to phonemes `\(\rightarrow\)` `getPhon()`: ```r getPhon(ft = c("+syl", "+hi"), lg = "French") #> [1] "u" "i" "y" getPhon(ft = c("-DR", "-cont", "-son"), lg = "English") #> [1] "t" "d" "b" "k" "g" "p" getPhon(ft = c("-son", "+vce"), lg = "Spanish") #> [1] "z" "d" "b" "ʝ" "g" "v" ``` --- ## Example 11: word generator and phonotactic probability 🎲 - `wug_pt()` creates possible words in Portuguese ```r set.seed(1) wug_pt(profile = "LHL") #> [1] "dɾa.ˈbuɾ.me" ``` -- - Let's create a tibble with 8 new words + their phonotactic probability with <mark>`biGram_pt()`</mark>: ```r set.seed(1) gen = tibble(word = wug_pt("LHL", n = 8)) |> mutate(bigram = word |> * biGram_pt() ) ``` --- ## Example 11: word generator and phonotactic probability 🎲 .pull-left[ - Lower bigram values `\(\rightarrow\)` less probable <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> word </th> <th style="text-align:right;"> bigram </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> la.ˈxus.te </td> <td style="text-align:right;"> -45.74615 </td> </tr> <tr> <td style="text-align:left;"> xo.ˈman.xo </td> <td style="text-align:right;"> -46.32619 </td> </tr> <tr> <td style="text-align:left;"> be.ˈʒan.tɾe </td> <td style="text-align:right;"> -49.19741 </td> </tr> <tr> <td style="text-align:left;"> dɾa.ˈbuɾ.me </td> <td style="text-align:right;"> -49.23458 </td> </tr> <tr> <td style="text-align:left;"> ze.ˈfɾan.ka </td> <td style="text-align:right;"> -50.74279 </td> </tr> <tr> <td style="text-align:left;"> ʒa.ˈgɾan.fe </td> <td style="text-align:right;"> -51.86230 </td> </tr> <tr> <td style="text-align:left;"> ʃe.ˈfoɾ.bɾe </td> <td style="text-align:right;"> -55.46661 </td> </tr> <tr> <td style="text-align:left;"> me.ˈxes.vɾo </td> <td style="text-align:right;"> -68.84952 </td> </tr> </tbody> </table> ] -- .pull-right[ - Bigrams are calculated based on [PSL](https://gdgarcia.ca/psl) - Access the simplified version of the lexicon with `pt_lex` - Or the whole lexicon with `psl` ```r set.seed(1) pt_lex |> sample_n(5) #> # A tibble: 5 × 2 #> word pro #> <fct> <chr> #> 1 brocador bɾo.ka.ˈdoɾ #> 2 flagelariáceo fla.ʒe.la.ɾi.ˈa.sjo #> 3 ultratumular ul.tɾa.tu.mu.ˈlaɾ #> 4 desencrencamento de.zen.kɾen.ka.ˈmen.to #> 5 hulheira u.ˈʎej.ɾa ``` ] --- ## Example 12: Listing bigrams 🎲 .pull-left[ ```r lus_bigramas = lus1 |> cleanText() |> ipa() |> * nGramTbl(n = 2) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> nGrams </th> <th style="text-align:left;"> n1 </th> <th style="text-align:left;"> n2 </th> <th style="text-align:right;"> freq </th> <th style="text-align:right;"> prop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> e# </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> # </td> <td style="text-align:right;"> 12164 </td> <td style="text-align:right;"> 0.0348513 </td> </tr> <tr> <td style="text-align:left;"> a# </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> # </td> <td style="text-align:right;"> 11556 </td> <td style="text-align:right;"> 0.0331093 </td> </tr> <tr> <td style="text-align:left;"> o# </td> <td style="text-align:left;"> o </td> <td style="text-align:left;"> # </td> <td style="text-align:right;"> 10818 </td> <td style="text-align:right;"> 0.0309948 </td> </tr> <tr> <td style="text-align:left;"> s# </td> <td style="text-align:left;"> s </td> <td style="text-align:left;"> # </td> <td style="text-align:right;"> 10243 </td> <td style="text-align:right;"> 0.0293474 </td> </tr> <tr> <td style="text-align:left;"> #k </td> <td style="text-align:left;"> # </td> <td style="text-align:left;"> k </td> <td style="text-align:right;"> 6849 </td> <td style="text-align:right;"> 0.0196232 </td> </tr> <tr> <td style="text-align:left;"> #a </td> <td style="text-align:left;"> # </td> <td style="text-align:left;"> a </td> <td style="text-align:right;"> 6126 </td> <td style="text-align:right;"> 0.0175517 </td> </tr> <tr> <td style="text-align:left;"> #d </td> <td style="text-align:left;"> # </td> <td style="text-align:left;"> d </td> <td style="text-align:right;"> 5361 </td> <td style="text-align:right;"> 0.0153599 </td> </tr> <tr> <td style="text-align:left;"> #e </td> <td style="text-align:left;"> # </td> <td style="text-align:left;"> e </td> <td style="text-align:right;"> 5140 </td> <td style="text-align:right;"> 0.0147267 </td> </tr> </tbody> </table> ] -- .pull-right[ - <mark>`nGramTbl()`</mark> `\(\rightarrow\)` all bigrams - Now it's easy to plot patterns ### Visualizing bigrams - Together with `nGramTbl()`: - <mark>`plotnGrams()`</mark> `\(\rightarrow\)` figures with `ggplot2` - Two options: 1. `type = "heat"` 2. `type = "lollipop"` - In both, we define the number of bigrams `n` ] --- ## Example 12: Visualizing bigrams 🎲 .pull-left[ ```r lus_bigramas |> * plotnGrams(type = "lollipop", n = 10) ``` <img src="index_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r lus_bigramas |> * plotnGrams(type = "heat", n = 50) ``` <img src="index_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] --- ## Example 12: Visualizing bigrams 🎲 - Network with top 50 bigrams from *Os Lusíadas* with `networkD3` (excluding `#`) <br>
--- ## Example 13: vowel plot 🗣️ - A simple *wrapper* for F1/F2 in `ggplot2`. Example with `vowels` (hypothetical values): ```r ggplot(data = vowels, aes(x = F2, y = F1, color = vowel, label = vowel)) + geom_text() + theme_classic() + theme(legend.position = "none") ``` <img src="index_files/figure-html/unnamed-chunk-28-1.png" height="300px" style="display: block; margin: auto;" /> --- ## Example 13: vowel plot 🗣️ - A simple *wrapper* for F1/F2 in `ggplot2`. Example with `vowels` (hypothetical values): ```r ggplot(data = vowels, aes(x = F2, y = F1, color = vowel, label = vowel)) + geom_text() + theme_classic() + theme(legend.position = "none") + * formants() ``` <img src="index_files/figure-html/unnamed-chunk-29-1.png" height="300px" style="display: block; margin: auto;" /> --- ## Example 14: from IPA to TIPA ✏️ - If you use `\(\LaTeX\)`, `tipa` is essential - The function `ipa2tipa()` translates an IPA sequence into `tipa`: -- ```r "Aqui estão algumas palavras que desejo transcrever" |> cleanText() |> ipa(narrow = T) |> ipa2tipa(pre = "[ ", post = " ]") #> Done! Here's your tex code using TIPA: #> \textipa{ [ a."ki es."t\~{a}\~{w} aw."g\~{u}.m5s pa."la.vR5s "ke de."ze.ZU {""}tR\~{a}ns.kRe."veR ] } ``` -- <br> <div align="center"> <img src="tipa.png" alt="Ouput tipa" style="width:100%"> </div> --- # Questions? 😶🌫️ ```r ipa("obrigado") |> plotSon(syl = T) ``` <img src="index_files/figure-html/unnamed-chunk-31-1.png" height="300px" style="display: block; margin: auto;" /> - This project has benefitted from the ENVOL program (Université Laval)