Created: May, 2022 Last updated: August 05, 2022
In this tutorial, we will see how regular expressions (Regex) can help us syllabify words in any language. We will use simple examples from English, but you can easily adapt the code to accommodate any phonotactic pattern of interest. Our words will be in orthographic form to keep things simple. Don’t worry: the method is exactly the same, so you can easily adapt it to inputs that are phonetically transcribed (as they should be!).
I assume you already use R and that you may be familiar with regular expressions (if you’re not, see here and see RStudio’s cheat sheet here). We will be syllabifying the three words in the vector below. Obviously, this is just a tiny sample to show you how to get started.
library(tidyverse) = c("international", "clandestine", "crestfallen")words
Starting point: CV syllables
The easiest way to start syllabifying words is to assume a CV
template. This is a simplistic assumption for English, but it makes
phonological sense (and it’s easier to code!).1 Here’s how we can
think about this in terms of regular expressions: we want to replace a
V-. To accomplish that, we
need to use capturing groups. It will be clear below how useful
capturing groups can be in many tasks—especilly when we work with
To replace a given string (i.e.,
V) with another string
V-), we will use
which comes from the
stringr package (loaded when you load
tidyverse). Notice that the replacement must match the
pattern we’re replacing, such that if we’re looking for
the replacement must be
a-. In a nutshell, we’d like to
create a “variable” that repeats the input in the output. This can’t be
done with simple replacement, of course. In the code blocks below, the
CV will hold our syllabified outputs.
library(tidyverse) = str_replace_all(string = words, CV pattern = "([aeiou])", replacement = "\\1-") CV
##  "i-nte-rna-ti-o-na-l" "cla-nde-sti-ne-" "cre-stfa-lle-n"
The pattern in the code above,
([aeiou]), is a capturing
group because it’s in parentheses. Inside the group, we have
[aeiou]. Square brackets simply mean “any of the characters
inside should be matched”. As a result, we’re looking for any
(orthographic) vowel. We will replace this vowel (whichever vowel we
find) with the same vowel + a hyphen, hence
\\1-. Number 1
here simply refers back to our group; since there’s only one group, we
Now that we have syllables in our
CV variable, it’s time
to make the necessary (language-specific) adjustments. We can start by
fixing the endings of our syllabified entries. First, we need to remove
word-final hyphens, as
cla-nde-sti-ne-. Second, we need to
# Remove hyphen at the right edge of the word: = CV %>% CV str_remove_all(pattern = "-$") # Replace -C# with C: = CV %>% CV str_replace_all(pattern = "-([bcdfghjklmnpqrstvxz]$)", replacement = "\\1") CV
##  "i-nte-rna-ti-o-nal" "cla-nde-sti-ne" "cre-stfa-llen"
Next, we need to fix our illicit onsets:
-stf. Let’s ignore
ll in crestfallen since we’re dealing with
orthography anyway. Notice that all four problematic clusters can be
split into two parts to become licit onsets or codas in English:
In other words, by moving the hyphen, we fix the issue. The key here is
sf as a single group (yet another advantage of
using capturing groups).
Below, we split the clusters into two groups: (1)
(n|r|st) and (2)
(t|n|d|f).2 Notice that
any combination of these two groups will result in an illicit
onset cluster in English (e.g.,
-rf, etc.). We basically want to go from –(1)(2) to
(1)–(2). That’s exactly what we do here (using double backslashes before
each group number).
# Fix onsets: = CV %>% CV str_replace_all(pattern = "-(n|r|st)(t|n|d|f)", replacement = "\\1-\\2") CV
##  "in-ter-na-ti-o-nal" "clan-de-sti-ne" "crest-fa-llen"
Finally, we may want to improve the syllabification of
clandestine. More specifically, we want the following
clan-des-ti-ne (which is
exactly how you’d syllabify this word in a language like Portuguese). To
do that, we want every
s in a cluster that follows an open
syllable to become the coda of said syllable. That’s what the line of
code below does. Here, we’re assuming that the clusters
should broken into coda-onset sequences (if they follow an open
syllable). Naturally, this rule can/should be improved. The point here
is to show how we can easily accomplish these substitutions using
# Fix "clandestine": = CV %>% CV str_replace_all(pattern = "([aeiou])-s([tpnml])", replacement = "\\1s-\\2") CV
##  "in-ter-na-ti-o-nal" "clan-des-ti-ne" "crest-fa-llen"
Clearly, working out all the necessary substitutions for an entire language is not an easy task. But the overall structure of our code will follow the same rationale we see above. With regular expressions and capturing groups, manipulating strings is surprisingly straightforward.
Copyright © 2022 Guilherme Duarte Garcia