Fonology

class: title-slide, inverse, center, middle

# Fonology
## une extension R pour l’analyse phonologique à partir de données écrites
### Guilherme D. Garcia

#### Université Laval • CRBLM • CRIHN

---

## Introduction

<img src="fonology.png" alt="Fonology" style="width:15%; float: right">
- Une extension R pour extraire des variables phonologiques des données écrites
- Disponible sur [fr.gdgarcia.ca/fonology](https://fr.gdgarcia.ca/fonology) (en développement)

- **Cette présentation** : une démonstration des principales fonctions

## Objectif

> Automatiser la préparation des données pour l'analyse phonologique/linguistique

## Domaines

- **La recherche** : vitesse et précision
- **L'enseignement** : interactivité et didactique

---

## Installation, etc. 🧐

- Visitez [fr.gdgarcia.ca/fonology](https://gdgarcia.ca/fonology) 
- Pour installer l'extension :

```r
library(devtools) # install.packages("devtools")
install_github("guilhermegarcia/fonology")
```

## Bugs, questions, etc. 🪲

- [github.com/guilhermegarcia/fonology/issues](https://github.com/guilhermegarcia/fonology/issues)

---

## Aujourd'hui 🗺️

### Les fonctions principales

1. <h-l>La transcription phonémique</h-l>
--

2. L'accent, la syllabe et ses composants
--

3. La sonorité
--

4. Le trapèze vocalique
--

5. Les classes naturelles
--

6. Le générateur de mots + probabilité phonotactique
--

***

<br>

> *Peu de choses peuvent être faites sans la conversion graphème-phonème*

<br>
- <h-l>Les données écrites</h-l> : facile de trouver et de collecter; difficile à analyser
- La transcription phonémique est donc **essentielle**

---

class: inverse, center, middle

# Les fonctions principales

---

## Exemple 1 : transcription phonémique

- `ipa_pt(...)` : transcription du portugais

```r
library(Fonology)

ipa_pt("concentração")
#> [1] "kon.sen.tɾa.ˈsãw̃"
ipa_pt("tipos")
#> [1] "ˈti.pos"
ipa_pt("quiséssemos")
#> [1] "ki.ˈzɛ.se.mos"
ipa_pt("parangaricutirrimirruaro")
#> [1] "pa.ɾan.ga.ɾi.ku.ti.xi.mi.ˈxu.a.ɾo"
```

- **Non vectorisée** (c.-à-d. sérielle) : idéale pour *un* seul mot
- <h-l>Avantage</h-l> : l'attribution probabiliste de l'accent (utile pour des mots hypothétiques)

---

## Exemple 2 : transcription phonétique

- `ipa_pt(..., narrow = T)`

```r
ipa_pt("concentração", narrow = T)
#> [1] "kõn.ˌsẽj̃ɲ.tɾa.ˈsãw̃"
ipa_pt("tipos", narrow = T)
#> [1] "ˈt͡ʃi.pʊs"
ipa_pt("quiséssemos", narrow = T)
#> [1] "ki.ˈzɛ.se.mʊs"
ipa_pt("parangaricutirrimirruaro", narrow = T)
#> [1] "pa.ˌɾãŋ.ga.ˌɾi.ku.ˌt͡ʃi.xi.ˌmi.xu.ˈa.ɾʊ"
```

- **Non vectorisée** (c.-à-d. sérielle) : idéale pour *un* seul mot
- <h-l>Avantage</h-l> : l'attribution probabiliste de l'accent (utile pour des mots hypothétiques)

---

## Exemple 3 : transcription *en masse*

- <h-l>Crucial</h-l> : être capable de transcrire beaucoup de mots **rapidement**
- `ipa(...)` : vectorisée (<h-l>portugais et espagnol</h-l>); français en développement

```r
ipa(word = c("Example", "com", "múltiplas", "palavras"))
#> [1] "e.ˈzam.ple"   "ˈkom"         "ˈmul.ti.plas" "pa.ˈla.vɾas"
```

- Transcription phonétique disponible pour le portugais :

```r
ipa(word = c("Encontramos", "transcrição", 
             "fonética", "fina", "também"), 
    narrow = T)
#> [1] "ˌẽj̃ɲ.kõn.ˈtɾã.mʊs" "ˌtɾãns.kɾi.ˈsãw̃"   "fo.ˈnɛ.t͡ʃi.kɐ"    
#> [4] "ˈfĩ.nɐ"            "tãm.ˈbẽj̃ɲ"
```

- Fonction **vectorisée** (c.-à-d., parallèle) : idéale pour les grandes quantités de mots
- <h-l>Avantage</h-l> : vitesse (*mais* l'accent est attribué de façon **catégorique**)

---

## Exemple 4 : un texte court 💬

- `ipa()` exige des entrées tokénisés 
- Et si notre entrée est un texte?
- <mark>`cleanText()`</mark> : nettoyage et tokénisation de données

```r
library(tidyverse)
# Exemple en portugais :
text = "Este é um teXto 123# bastante cUrto que Não está tokenizado"

text |> 
* cleanText() |>
  ipa()
#>  [1] "ˈes.te"          "ˈɛ"              "ˈum"             "ˈtes.to"        
#>  [5] "bas.ˈtan.te"     "ˈkuɾ.to"         "ˈke"             "ˈnãw̃"           
#>  [9] "es.ˈta"          "to.ke.ni.ˈza.do"
```

---

## Exemple 5 : d'un texte courte à un tableau *tidy* 💬

```r
text = "Este é um teXto 123# bastante cUrto que Não está tokenizado"

d = tibble(word = text |> cleanText()) |> # Words
  mutate(ipa = word |> ipa()) # IPA
```

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> word </th>
   <th style="text-align:left;"> ipa </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> este </td>
   <td style="text-align:left;"> ˈes.te </td>
  </tr>
  <tr>
   <td style="text-align:left;"> é </td>
   <td style="text-align:left;"> ˈɛ </td>
  </tr>
  <tr>
   <td style="text-align:left;"> um </td>
   <td style="text-align:left;"> ˈum </td>
  </tr>
  <tr>
   <td style="text-align:left;"> texto </td>
   <td style="text-align:left;"> ˈtes.to </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bastante </td>
   <td style="text-align:left;"> bas.ˈtan.te </td>
  </tr>
  <tr>
   <td style="text-align:left;"> curto </td>
   <td style="text-align:left;"> ˈkuɾ.to </td>
  </tr>
</tbody>
</table>

---

## Exemple 6 : d'un texte long à un tableau *tidy* 📚

### Tâche

1. Importer *Os Lusíadas*, nettoyer et tokéniser le texte
2. Transcrire, syllaber, accentuer les mots lexicaux
3. Extraire l'accent et la syllabe finale
4. Extraire les composantes de cette syllabe

- `getStress()` : extrait l'accent d'un mot phonémiquement transcrit
- `getWeight()` : extrait le poids syllabique (p. ex., `LLH`)
- `getSyl()` : extrait une syllabe spécifique
- `syllable()` : extrait les composantes syllabiques
- `stopwords_pt` et `stopwords_sp` : *stopwords* en portugais ou espagnol (adaptées de l'extension `tm`)

---

## Exemple 6 : d'un texte long à un *tibble* 📚

.panelset[
.panel[.panel-name[Code]

```r
lus1 = read_lines("lusiadas.txt")

lus2 = lus1 |> 
  cleanText() |>                                         # nettoyage + tokénisation
  as_tibble() |> 
  rename(word = value) |> 
  filter(!word %in% stopwords_pt) |>                     # suppression des stopwords
  mutate(ipa = ipa(word),                                # colonne de transcription
         stress = getStress(ipa),                        # colonne de l'accent
         weight = getWeight(ipa),                        # colonne du poids syllabique
         finSyl = getSyl(word = ipa, pos = 1),           # colonne de la syllabe finale
         onsetFin = syllable(finSyl, const = "onset"),
         nucFin = syllable(finSyl, const = "nucleus"),
         codaFin = syllable(finSyl, const = "coda"),
         rhFin = syllable(finSyl, const = "rhyme"))
```

]

.panel[.panel-name[Résultat]

- Nombre total de mots lexicaux 32618 (⏳ **< 2s**)
- *Tidy data* format prêt pour l'analyse

<br>

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> word </th>
   <th style="text-align:left;"> ipa </th>
   <th style="text-align:left;"> stress </th>
   <th style="text-align:left;"> weight </th>
   <th style="text-align:left;"> finSyl </th>
   <th style="text-align:left;"> onsetFin </th>
   <th style="text-align:left;"> nucFin </th>
   <th style="text-align:left;"> codaFin </th>
   <th style="text-align:left;"> rhFin </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> armas </td>
   <td style="text-align:left;"> ˈaɾ.mas </td>
   <td style="text-align:left;"> penult </td>
   <td style="text-align:left;"> HH </td>
   <td style="text-align:left;"> mas </td>
   <td style="text-align:left;"> m </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> s </td>
   <td style="text-align:left;"> as </td>
  </tr>
  <tr>
   <td style="text-align:left;"> barões </td>
   <td style="text-align:left;"> ba.ˈɾõj̃s </td>
   <td style="text-align:left;"> final </td>
   <td style="text-align:left;"> LH </td>
   <td style="text-align:left;"> ɾõj̃s </td>
   <td style="text-align:left;"> ɾ </td>
   <td style="text-align:left;"> õj̃ </td>
   <td style="text-align:left;"> s </td>
   <td style="text-align:left;"> õj̃s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> assinalados </td>
   <td style="text-align:left;"> a.si.na.ˈla.dos </td>
   <td style="text-align:left;"> penult </td>
   <td style="text-align:left;"> LLH </td>
   <td style="text-align:left;"> dos </td>
   <td style="text-align:left;"> d </td>
   <td style="text-align:left;"> o </td>
   <td style="text-align:left;"> s </td>
   <td style="text-align:left;"> os </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ocidental </td>
   <td style="text-align:left;"> o.si.den.ˈtal </td>
   <td style="text-align:left;"> final </td>
   <td style="text-align:left;"> LHH </td>
   <td style="text-align:left;"> tal </td>
   <td style="text-align:left;"> t </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> l </td>
   <td style="text-align:left;"> al </td>
  </tr>
  <tr>
   <td style="text-align:left;"> praia </td>
   <td style="text-align:left;"> ˈpɾa.ja </td>
   <td style="text-align:left;"> penult </td>
   <td style="text-align:left;"> LL </td>
   <td style="text-align:left;"> ja </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> ja </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> ja </td>
  </tr>
</tbody>
</table>
]

.panel[.panel-name[Semi-voyelles?]

- Le statut des semi-voyelles

<br>

]
]

---

## Exemple 7 : sonorité 🔉

- `demi(word = ..., d = ...)`: extraction de demi-syllabes (`d = 1` ou `d = 2`)

```r
syllables = c("kom", "sil", "fran", "klas")

syllables |> 
  demi(d = 1) # extraire la première demi-syllabe
#> [1] "ko"  "si"  "fra" "kla"
```

- On peut également calculer la dispersion de sonorité moyenne pour un vecteur avec `meanSonDisp()`:

```r
syllables |> 
  demi(d = 1) |> 
  meanSonDisp()
#> [1] 2.67
```

- **Observation** : La fonction suppose 17 niveaux de sonorité (Parker 2011)<sup>1</sup>

.footnote[[1] Parker, S. (2011). Sonority. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), *The Blackwell companion to phonology* (pp. 1160–1184). Wiley Online Library. https://doi.org/10.1002/9781444335262.wbctp0049]

---

## Exemple 8 : sonorité 🔊

- Pour l'enseignement de la phonologie :

.pull-left[

```r
"combradol" |> 
  ipa() |> 
* plotSon(syl = F)
```

<img src="index_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
]

.pull-right[

```r
"sobremesa" |> 
  ipa(lg = "Spanish") |> 
* plotSon(syl = T)
```

<img src="index_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" />
]

---

## Exemple 9 : trapèzes vocaliques

- `plotVowels()` génère le trapèze vocalique pour quelques langues
- La fonction peut exporter un ficher <mark>`tex`</mark> pour ceux qui utilisent `\(\LaTeX\)`

.pull-left[

```r
plotVowels(lg = "Spanish", 
*          tex = F)
```

<img src="index_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />
]

.pull-right[

```r
plotVowels(lg = "Italian", 
*          tex = F)
```

<img src="index_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
]

---

## Exemple 10 : les classes naturelles et les traits distinctifs

- Des phonèmes aux traits distinctifs `\(\rightarrow\)` `getFeat()` :

```r
getFeat(ph = c("i", "u"), lg = "English")
#> [1] "+hi"    "+tense"
getFeat(ph = c("p", "b"), lg = "French")
#> [1] "-son"  "-cont" "+lab"
getFeat(ph = c("i", "y", "u"), lg = "French")
#> [1] "+syl" "+hi"
```

- Des traits distinctifs aux phonèmes `\(\rightarrow\)` `getPhon()` :

```r
getPhon(ft = c("+syl", "+hi"), lg = "French")
#> [1] "u" "i" "y"
getPhon(ft = c("-DR", "-cont", "-son"), lg = "English")
#> [1] "t" "d" "b" "k" "g" "p"
getPhon(ft = c("-son", "+vce"), lg = "Spanish")
#> [1] "z" "d" "b" "ʝ" "g" "v"
```

---

## Exemple 11 : générateur de mots + probabilité phonotactique 🎲

- `wug_pt()` génère des mots possibles en portugais

```r
set.seed(1)
wug_pt(profile = "LHL")
#> [1] "dɾa.ˈbuɾ.me"
```

- Un *tibble* avec 8 nouveaux mots + leurs probabilités phonotactiques avec <mark>`biGram_pt()`</mark> :

```r
set.seed(1)
gen = tibble(word = wug_pt("LHL", n = 8)) |> 
  mutate(bigram = word |> 
*          biGram_pt()
  ) 
```

---

## Exemple 11 : générateur de mots + probabilité phonotactique 🎲

.pull-left[
- Bigrammes :

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> word </th>
   <th style="text-align:right;"> bigram </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> la.ˈxus.te </td>
   <td style="text-align:right;"> -45.74615 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> xo.ˈman.xo </td>
   <td style="text-align:right;"> -46.32619 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> be.ˈʒan.tɾe </td>
   <td style="text-align:right;"> -49.19741 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dɾa.ˈbuɾ.me </td>
   <td style="text-align:right;"> -49.23458 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ze.ˈfɾan.ka </td>
   <td style="text-align:right;"> -50.74279 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ʒa.ˈgɾan.fe </td>
   <td style="text-align:right;"> -51.86230 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ʃe.ˈfoɾ.bɾe </td>
   <td style="text-align:right;"> -55.46661 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> me.ˈxes.vɾo </td>
   <td style="text-align:right;"> -68.84952 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
- Les bigrammes sont calculés selon le [PSL](https://gdgarcia.ca/psl)
- La version simplifiée du lexique `\(\rightarrow\)` `pt_lex`
- Ou le lexique complet `\(\rightarrow\)` `psl`

```r
set.seed(1)
pt_lex |> sample_n(5)
#> # A tibble: 5 × 2
#>   word             pro                   
#>   <fct>            <chr>                 
#> 1 brocador         bɾo.ka.ˈdoɾ           
#> 2 flagelariáceo    fla.ʒe.la.ɾi.ˈa.sjo   
#> 3 ultratumular     ul.tɾa.tu.mu.ˈlaɾ     
#> 4 desencrencamento de.zen.kɾen.ka.ˈmen.to
#> 5 hulheira         u.ˈʎej.ɾa
```
]

---

## Exemple 12 : Bigrammes 🎲

.pull-left[

```r
lus_bigramas = lus1 |> 
  cleanText() |> 
  ipa() |> 
* nGramTbl(n = 2)
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> nGrams </th>
   <th style="text-align:left;"> n1 </th>
   <th style="text-align:left;"> n2 </th>
   <th style="text-align:right;"> freq </th>
   <th style="text-align:right;"> prop </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> e# </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:right;"> 12164 </td>
   <td style="text-align:right;"> 0.0348513 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> a# </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:right;"> 11556 </td>
   <td style="text-align:right;"> 0.0331093 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> o# </td>
   <td style="text-align:left;"> o </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:right;"> 10818 </td>
   <td style="text-align:right;"> 0.0309948 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> s# </td>
   <td style="text-align:left;"> s </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:right;"> 10243 </td>
   <td style="text-align:right;"> 0.0293474 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> #k </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:left;"> k </td>
   <td style="text-align:right;"> 6849 </td>
   <td style="text-align:right;"> 0.0196232 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> #a </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:right;"> 6126 </td>
   <td style="text-align:right;"> 0.0175517 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> #d </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:left;"> d </td>
   <td style="text-align:right;"> 5361 </td>
   <td style="text-align:right;"> 0.0153599 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> #e </td>
   <td style="text-align:left;"> # </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:right;"> 5140 </td>
   <td style="text-align:right;"> 0.0147267 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
- <mark>`nGramTbl()`</mark> `\(\rightarrow\)` tous les bigrammes

### Visualisation des bigrammes

- Avec `nGramTbl()` :
- <mark>`plotnGrams()`</mark> `\(\rightarrow\)` figures avec `ggplot2`

- Deux options :
1. `type = "heat"`
2. `type = "lollipop"`

- Le nombre de bigrammes = `n`
]

---

## Exemple 12 : Visualisation des bigrammes 🎲

.pull-left[

```r
lus_bigramas |> 
* plotnGrams(type = "lollipop", n = 10)
```

<img src="index_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
]

.pull-right[

```r
lus_bigramas |> 
* plotnGrams(type = "heat", n = 50)
```

<img src="index_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" />
]

---

## Exemple 12 : Visualisation des bigrammes 🎲

- Réseau avec les principaux bigrammes de *Os Lusíadas* avec `networkD3` (à l'exclusion de `#`)

<br>

<div class="iframe-container">
<iframe src="my_network.html" height="450" width="70%" class="d3-figure"></iframe>
</div>
---

## Exemple 13 : vowel plot 🗣️

- A simple *wrapper* for F1/F2 in `ggplot2`. Example with `vowels` (hypothetical values) :

```r
ggplot(data = vowels, aes(x = F2, y = F1, color = vowel, label = vowel)) +
  geom_text() +
  theme_classic() + theme(legend.position = "none")
```

---

## Exemple 13 : graphique de voyelles 🗣️

- Un *wrapper* simple pour visualiser les formants F1/F2 avec `ggplot2`. Exemple avec `vowels` (valeurs hypothétiques) :

```r
ggplot(data = vowels, aes(x = F2, y = F1, color = vowel, label = vowel)) +
  geom_text() +
  theme_classic() + theme(legend.position = "none") +
* formants()
```

---

## Exemple 14 : de l'API à l'extension TIPA ✏️

- Si vous utilisez `\(\LaTeX\)`, `tipa` est une extension essentielle
- La fonction `ipa2tipa()` traduit une séquence en API vers une séquence `tipa` :

```r
"Aqui estão algumas palavras que desejo transcrever" |> 
  cleanText() |> 
  ipa(narrow = T) |> 
  ipa2tipa(pre = "[ ", post = " ]")
#> Done! Here's your tex code using TIPA:
#> \textipa{ [ a."ki es."t\~{a}\~{w} aw."g\~{u}.m5s pa."la.vR5s "ke de."ze.ZU {""}tR\~{a}ns.kRe."veR ] }
```

<br>

<div align="center">
<img src="tipa.png" alt="Ouput tipa" style="width:100%">
</div>
---

# Questions? 😶‍🌫️

```r
ipa("obrigado") |> 
  plotSon(syl = T)
```

- Ce projet a bénéficié du programme ENVOL (Université Laval)