The function below, biGram()
, calculates the bigram
probability for any given word based on a given corpus. The output is
logarithmic sum of the individual probabilities for each segmental
bigram. The function requires two arguments, namely, a word
(x
) and a corpus/list of words. Some lines in the function
below are based on the Portuguese Stress Lexicon.
The function requires tidyverse
.
To work with n-grams in general, I strongly recommend the excellent
ngram
package.
= function(x, corpus){
biGram
library(tidyverse)
= unlist(str_split(corpus, pattern = " "))
words = str_replace_all(string = words, pattern = "-", replacement = "")
words = str_replace_all(string = words, pattern = "'", replacement = "")
words
= str_split(x, pattern = "")[[1]]
x1
= c()
bigrams 1] = paste("^", x1[1], sep = "")
bigrams[
# Adding word-internal bigrams
for(i in 1:(length(x1)-1)){
= str_c(x1[i], x1[i+1], sep = "")
seq length(bigrams)+1] = seq
bigrams[
}
# Adding word-final bigram
length(bigrams)+1] = str_c(x1[length(x1)], "$", sep = "")
bigrams[
# Variable for all probabilities
= c()
probs
for(bigram in bigrams){
length(probs)+1] = sum(str_count(words, bigram)) /
probs[sum(str_count(words, str_split(bigram, pattern = "")[[1]][1]))
}
return(log(prod(probs)))
}
Copyright © 2022 Guilherme Duarte Garcia