Fonology

class: title-slide, inverse, center, middle

# Fonology
## Phonological Analysis in R
### Guilherme D. Garcia

#### Université Laval • CRBLM • CRIHN

---

## What is it? 🤔

<img src="fonology.png" alt="Fonology" style="width:15%; float: right">
- An R package to help phonologists automate certain tasks
- Currently under development; frequent updates

- **This presentation**: demo of main functions

## Goal

> Automate data prep for phonological analysis

## Content

- **Research**: speed and precision (scalability)
- **Teaching**: accessibility and interactivity

---

## Installation etc. 🧐

- Visit [gdgarcia.ca/fonology](https://gdgarcia.ca/fonology) for detailed info
- To install the package:

```r
library(devtools) # install.packages("devtools")
install_github("guilhermegarcia/fonology")
```

## Feedback, bugs, questions 🪲

- Go here: [github.com/guilhermegarcia/fonology/issues](https://github.com/guilhermegarcia/fonology/issues)

## Assumption of this presentation

- You know some basic R and are familiar with `tidyverse`

---

## Road map 🗺️

### Demo of main functions:

1. <h-l>Phonemic transcription</h-l>
--

2. Extraction of stress/syllables, syllabic constituents
--

3. Sonority
--

4. Vowel trapezoids
--

5. Natural classes
--

6. Word generator (pt) + phonotactic probability
--

7. Vowel formants + `ggplot2` (simple *wrapper*)
--

8. From IPA to TIPA

***

<br>

> *Little can be done without grapheme-phoneme conversion*

<br>
- <h-l>Written data</h-l>: easy to find and collect, difficult to analyze
- Phonemic transcription: **essential** starting point

---

class: inverse, center, middle

# Main functions

---

## Example 1: broad transcription

- `ipa_pt(...)`: phonemic transcription for Portuguese

```r
library(Fonology)

ipa_pt("concentração")
#> [1] "kon.sen.tɾa.ˈsãw̃"
ipa_pt("tipos")
#> [1] "ˈti.pos"
ipa_pt("quiséssemos")
#> [1] "ki.ˈzɛ.se.mos"
ipa_pt("parangaricutirrimirruaro")
#> [1] "pa.ɾan.ga.ɾi.ku.ti.xi.mi.xu.ˈa.ɾo"
```

- **Not vectorized** (i.e., serial): ideal for *one* input
- <h-l>Advantage</h-l>: probabilistic stress assignment, useful for nonce words

---

## Example 2: narrow transcription

- `ipa_pt(..., narrow = T)`

```r
ipa_pt("concentração", narrow = T)
#> [1] "kõn.ˌsẽj̃ɲ.tɾa.ˈsãw̃"
ipa_pt("tipos", narrow = T)
#> [1] "ˈt͡ʃi.pʊs"
ipa_pt("quiséssemos", narrow = T)
#> [1] "ki.ˈzɛ.se.mʊs"
ipa_pt("parangaricutirrimirruaro", narrow = T)
#> [1] "pa.ˌɾãŋ.ga.ˌɾi.ku.ˌt͡ʃi.xi.ˌmi.xu.ˈa.ɾʊ"
```

- **Not vectorized** (i.e., serial): ideal for *one* input
- <h-l>Advantage</h-l>: probabilistic stress assignment, useful for nonce words

---

## Example 3: transcription *en masse*

- <h-l>Crucial</h-l>: being able to transcribe a lot of words *quickly*
- `ipa(...)`: vectorized (<h-l>Portuguese and Spanish</h-l>)

```r
ipa(word = c("Example", "com", "múltiplas", "palavras"))
#> [1] "e.ˈzam.ple"   "ˈkom"         "ˈmul.ti.plas" "pa.ˈla.vɾas"
```

- Narrow transcription also available (for Portuguese only):

```r
ipa(word = c("Encontramos", "transcrição", 
             "fonética", "fina", "também"), 
    narrow = T)
#> [1] "ˌẽj̃ɲ.kõn.ˈtɾã.mʊs" "ˌtɾãns.kɾi.ˈsãw̃"   "fo.ˈnɛ.t͡ʃi.kɐ"    
#> [4] "ˈfĩ.nɐ"            "tãm.ˈbẽj̃ɲ"
```

- **Vectorized** function (i.e., parallel): ideal for a lot of data
- <h-l>Advantage</h-l>: speed (*but* stress is assigned **categorically**)

---

## Example 4: short text 💬

- `ipa()` requires tokenized inputs
- What if our input is a text...?
- <mark>`cleanText()`</mark>: data cleaning and tokenization

```r
library(tidyverse)
# Sample sentence in Portuguese with some weird stuff in it:
text = "Este é um teXto 123# bastante cUrto que Não está tokenizado"

text |> 
* cleanText() |>
  ipa()
#>  [1] "ˈes.te"          "ˈɛ"              "ˈum"             "ˈtes.to"        
#>  [5] "bas.ˈtan.te"     "ˈkuɾ.to"         "ˈke"             "ˈnãw̃"           
#>  [9] "es.ˈta"          "to.ke.ni.ˈza.do"
```

---

## Example 5: from short text to tibble 💬

- We often work with *data frames* or *tibbles*

```r
text = "Este é um teXto 123# bastante cUrto que Não está tokenizado"

d = tibble(word = text |> cleanText()) |> # Words
  mutate(ipa = word |> ipa()) # IPA
```

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> word </th>
   <th style="text-align:left;"> ipa </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> este </td>
   <td style="text-align:left;"> ˈes.te </td>
  </tr>
  <tr>
   <td style="text-align:left;"> é </td>
   <td style="text-align:left;"> ˈɛ </td>
  </tr>
  <tr>
   <td style="text-align:left;"> um </td>
   <td style="text-align:left;"> ˈum </td>
  </tr>
  <tr>
   <td style="text-align:left;"> texto </td>
   <td style="text-align:left;"> ˈtes.to </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bastante </td>
   <td style="text-align:left;"> bas.ˈtan.te </td>
  </tr>
  <tr>
   <td style="text-align:left;"> curto </td>
   <td style="text-align:left;"> ˈkuɾ.to </td>
  </tr>
</tbody>
</table>

---

## Example 6: from long text to tibble 📚

### Task

1. Import *Os Lusíadas*, clean and tokenize text
2. Transcribe, syllabify, stress lexical words
3. Extract stress and final syllable
4. Extract final syllable constituents

- `getStress()`: extracts stress from transcription
- `getWeight()`: extracts weight profile (e.g., `LLH`)
- `getSyl()`: extracts a given syllable
- `syllable()`: extracts syllabic constituents
- `stopwords_pt` and `stopwords_sp`: stopwords (adapted from `tm` package)

---

## Example 6: from long text to tibble 📚

.panelset[
.panel[.panel-name[Code]

```r
lus1 = read_lines("lusiadas.txt")

lus2 = lus1 |> 
  cleanText() |>                                         # data cleaning + tokenization
  as_tibble() |> 
  rename(word = value) |> 
  filter(!word %in% stopwords_pt) |>                     # stopword removal
  mutate(ipa = ipa(word),                                # column for transcription
         stress = getStress(ipa),                        # column for stress
         weight = getWeight(ipa),                        # column for weight
         finSyl = getSyl(word = ipa, pos = 1),           # column for final syllable
         onsetFin = syllable(finSyl, const = "onset"),
         nucFin = syllable(finSyl, const = "nucleus"),
         codaFin = syllable(finSyl, const = "coda"),
         rhFin = syllable(finSyl, const = "rhyme"))
```

]

.panel[.panel-name[Result]

- Total number of lexical words 32618 (⏳ **< 2s**)
- *Tidy data* format ready for analysis

<br>

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> word </th>
   <th style="text-align:left;"> ipa </th>
   <th style="text-align:left;"> stress </th>
   <th style="text-align:left;"> weight </th>
   <th style="text-align:left;"> finSyl </th>
   <th style="text-align:left;"> onsetFin </th>
   <th style="text-align:left;"> nucFin </th>
   <th style="text-align:left;"> codaFin </th>
   <th style="text-align:left;"> rhFin </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> armas </td>
   <td style="text-align:left;"> ˈaɾ.mas </td>
   <td style="text-align:left;"> penult </td>
   <td style="text-align:left;"> HH </td>
   <td style="text-align:left;"> mas </td>
   <td style="text-align:left;"> m </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> s </td>
   <td style="text-align:left;"> as </td>
  </tr>
  <tr>
   <td style="text-align:left;"> barões </td>
   <td style="text-align:left;"> ba.ˈɾõj̃s </td>
   <td style="text-align:left;"> final </td>
   <td style="text-align:left;"> LH </td>
   <td style="text-align:left;"> ɾõj̃s </td>
   <td style="text-align:left;"> ɾ </td>
   <td style="text-align:left;"> õj̃ </td>
   <td style="text-align:left;"> s </td>
   <td style="text-align:left;"> õj̃s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> assinalados </td>
   <td style="text-align:left;"> a.si.na.ˈla.dos </td>
   <td style="text-align:left;"> penult </td>
   <td style="text-align:left;"> LLH </td>
   <td style="text-align:left;"> dos </td>
   <td style="text-align:left;"> d </td>
   <td style="text-align:left;"> o </td>
   <td style="text-align:left;"> s </td>
   <td style="text-align:left;"> os </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ocidental </td>
   <td style="text-align:left;"> o.si.den.ˈtal </td>
   <td style="text-align:left;"> final </td>
   <td style="text-align:left;"> LHH </td>
   <td style="text-align:left;"> tal </td>
   <td style="text-align:left;"> t </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> l </td>
   <td style="text-align:left;"> al </td>
  </tr>
  <tr>
   <td style="text-align:left;"> praia </td>
   <td style="text-align:left;"> ˈpɾa.ja </td>
   <td style="text-align:left;"> penult </td>
   <td style="text-align:left;"> LL </td>
   <td style="text-align:left;"> ja </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> ja </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> ja </td>
  </tr>
</tbody>
</table>
]

.panel[.panel-name[Glides?]

- Are glides onsets, codas, or are they in the nucleus...❓ 
- The function `syllable()` allows us to adjust our assumptions:

<br>

]
]

---

## Example 7: sonority 🔉

- `demi(word = ..., d = ...)`: extraction of demisyllables (`d = 1` or `d = 2`)

```r
syllables = c("kom", "sil", "fran", "klas")

syllables |> 
  demi(d = 1) # get first demisyllable
#> [1] "ko"  "si"  "fra" "kla"
```

- We can also calculate the average sonority dispersion of a vector with `meanSonDisp()`:

```r
syllables |> 
  demi(d = 1) |> 
  meanSonDisp()
#> [1] 2.67
```

- **Note**: The function assumes 17 levels of sonority (see Parker 2011)<sup>1</sup>

.footnote[[1] Parker, S. (2011). Sonority. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), *The Blackwell companion to phonology* (pp. 1160–1184). Wiley Online Library. https://doi.org/10.1002/9781444335262.wbctp0049]

---

## Example 8: sonority 🔊

- When teaching phonology, it may be useful to visualize sonority levels:

.pull-left[

```r
"combradol" |> 
  ipa() |> 
* plotSon(syl = F)
```

<img src="index_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
]

.pull-right[

```r
"sobremesa" |> 
  ipa(lg = "Spanish") |> 
* plotSon(syl = T)
```

<img src="index_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" />
]

---

## Example 9: vowel trapezoids

- `plotVowels()` generates trapezoids for some languages
- It also exports the <mark>`tex`</mark> file for those who use `\(\LaTeX\)`

.pull-left[

```r
plotVowels(lg = "Spanish", 
*          tex = F)
```

<img src="index_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />
]

.pull-right[

```r
plotVowels(lg = "Italian", 
*          tex = F)
```

<img src="index_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
]

---

## Example 10: natural classes and distinctive features

- From phonemes to traits `\(\rightarrow\)` `getFeat()`:

```r
getFeat(ph = c("i", "u"), lg = "English")
#> [1] "+hi"    "+tense"
getFeat(ph = c("p", "b"), lg = "French")
#> [1] "-son"  "-cont" "+lab"
getFeat(ph = c("i", "y", "u"), lg = "French")
#> [1] "+syl" "+hi"
```

- From traits to phonemes `\(\rightarrow\)` `getPhon()`:

```r
getPhon(ft = c("+syl", "+hi"), lg = "French")
#> [1] "u" "i" "y"
getPhon(ft = c("-DR", "-cont", "-son"), lg = "English")
#> [1] "t" "d" "b" "k" "g" "p"
getPhon(ft = c("-son", "+vce"), lg = "Spanish")
#> [1] "z" "d" "b" "ʝ" "g" "v"
```

---

## Example 11: word generator and phonotactic probability 🎲

- `wug_pt()` creates possible words in Portuguese

```r
set.seed(1)
wug_pt(profile = "LHL")
#> [1] "dɾa.ˈbuɾ.me"
```

- Let's create a tibble with 8 new words + their phonotactic probability with <mark>`biGram_pt()`</mark>:

```r
set.seed(1)
gen = tibble(word = wug_pt("LHL", n = 8)) |> 
  mutate(bigram = word |> 
*          biGram_pt()
  ) 
```

---

## Example 11: word generator and phonotactic probability 🎲

.pull-left[
- Lower bigram values `\(\rightarrow\)` less probable

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> word </th>
   <th style="text-align:right;"> bigram </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> la.ˈxus.te </td>
   <td style="text-align:right;"> -45.74615 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> xo.ˈman.xo </td>
   <td style="text-align:right;"> -46.32619 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> be.ˈʒan.tɾe </td>
   <td style="text-align:right;"> -49.19741 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dɾa.ˈbuɾ.me </td>
   <td style="text-align:right;"> -49.23458 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ze.ˈfɾan.ka </td>
   <td style="text-align:right;"> -50.74279 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ʒa.ˈgɾan.fe </td>
   <td style="text-align:right;"> -51.86230 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ʃe.ˈfoɾ.bɾe </td>
   <td style="text-align:right;"> -55.46661 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> me.ˈxes.vɾo </td>
   <td style="text-align:right;"> -68.84952 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
- Bigrams are calculated based on [PSL](https://gdgarcia.ca/psl)
- Access the simplified version of the lexicon with `pt_lex`
- Or the whole lexicon with `psl`

```r
set.seed(1)
pt_lex |> sample_n(5)
#> # A tibble: 5 × 2
#>   word             pro                   
#>   <fct>            <chr>                 
#> 1 brocador         bɾo.ka.ˈdoɾ           
#> 2 flagelariáceo    fla.ʒe.la.ɾi.ˈa.sjo   
#> 3 ultratumular     ul.tɾa.tu.mu.ˈlaɾ     
#> 4 desencrencamento de.zen.kɾen.ka.ˈmen.to
#> 5 hulheira         u.ˈʎej.ɾa
```
]

---

## Example 12: Listing bigrams 🎲

.pull-left[

```r
lus_bigramas = lus1 |> 
  cleanText() |> 
  ipa() |> 
* nGramTbl(n = 2)
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> nGrams </th>
   <th style="text-align:left;"> n1 </th>
   <th style="text-align:left;"> n2 </th>
   <th style="text-align:right;"> freq </th>
   <th style="text-align:right;"> prop </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> e# </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:right;"> 12164 </td>
   <td style="text-align:right;"> 0.0348513 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> a# </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:right;"> 11556 </td>
   <td style="text-align:right;"> 0.0331093 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> o# </td>
   <td style="text-align:left;"> o </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:right;"> 10818 </td>
   <td style="text-align:right;"> 0.0309948 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> s# </td>
   <td style="text-align:left;"> s </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:right;"> 10243 </td>
   <td style="text-align:right;"> 0.0293474 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> #k </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:left;"> k </td>
   <td style="text-align:right;"> 6849 </td>
   <td style="text-align:right;"> 0.0196232 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> #a </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:right;"> 6126 </td>
   <td style="text-align:right;"> 0.0175517 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> #d </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:left;"> d </td>
   <td style="text-align:right;"> 5361 </td>
   <td style="text-align:right;"> 0.0153599 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> #e </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:right;"> 5140 </td>
   <td style="text-align:right;"> 0.0147267 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
- <mark>`nGramTbl()`</mark> `\(\rightarrow\)` all bigrams
- Now it's easy to plot patterns

### Visualizing bigrams

- Together with `nGramTbl()`:
- <mark>`plotnGrams()`</mark> `\(\rightarrow\)` figures with `ggplot2`

- Two options:
1. `type = "heat"`
2. `type = "lollipop"`

- In both, we define the number of bigrams `n`
]

---

## Example 12: Visualizing bigrams 🎲

.pull-left[

```r
lus_bigramas |> 
* plotnGrams(type = "lollipop", n = 10)
```

<img src="index_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
]

.pull-right[

```r
lus_bigramas |> 
* plotnGrams(type = "heat", n = 50)
```

<img src="index_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" />
]

---

## Example 12: Visualizing bigrams 🎲

- Network with top 50 bigrams from *Os Lusíadas* with `networkD3` (excluding `#`)

<br>

<div class="iframe-container">
<iframe src="my_network.html" height="450" width="70%" class="d3-figure"></iframe>
</div>
---

## Example 13: vowel plot 🗣️

- A simple *wrapper* for F1/F2 in `ggplot2`. Example with `vowels` (hypothetical values):

```r
ggplot(data = vowels, aes(x = F2, y = F1, color = vowel, label = vowel)) +
  geom_text() +
  theme_classic() + theme(legend.position = "none")
```

---

## Example 13: vowel plot 🗣️

- A simple *wrapper* for F1/F2 in `ggplot2`. Example with `vowels` (hypothetical values):

```r
ggplot(data = vowels, aes(x = F2, y = F1, color = vowel, label = vowel)) +
  geom_text() +
  theme_classic() + theme(legend.position = "none") +
* formants()
```

---

## Example 14: from IPA to TIPA ✏️

- If you use `\(\LaTeX\)`, `tipa` is essential
- The function `ipa2tipa()` translates an IPA sequence into `tipa`:

```r
"Aqui estão algumas palavras que desejo transcrever" |> 
  cleanText() |> 
  ipa(narrow = T) |> 
  ipa2tipa(pre = "[ ", post = " ]")
#> Done! Here's your tex code using TIPA:
#> \textipa{ [ a."ki es."t\~{a}\~{w} aw."g\~{u}.m5s pa."la.vR5s "ke de."ze.ZU {""}tR\~{a}ns.kRe."veR ] }
```

<br>

<div align="center">
<img src="tipa.png" alt="Ouput tipa" style="width:100%">
</div>
---

# Questions? 😶‍🌫️

```r
ipa("obrigado") |> 
  plotSon(syl = T)
```

- This project has benefitted from the ENVOL program (Université Laval)