The function below, biGram(), calculates the bigram probability for any given word based on a given corpus. The output is logarithmic sum of the individual probabilities for each segmental bigram. The function requires two arguments, namely, a word (x) and a corpus/list of words. Some lines in the function below are based on the Portuguese Stress Lexicon. The function requires tidyverse.

To work with n-grams in general, I strongly recommend the excellent ngram package.

biGram = function(x, corpus){

library(tidyverse)
words = unlist(str_split(corpus, pattern = " "))
words = str_replace_all(string = words, pattern = "-", replacement = "")
words = str_replace_all(string = words, pattern = "'", replacement = "")

x1 = str_split(x, pattern = "")[[1]]

bigrams = c()
bigrams[1] = paste("^", x1[1], sep = "")

for(i in 1:(length(x1)-1)){
seq = str_c(x1[i], x1[i+1], sep = "")
bigrams[length(bigrams)+1] = seq
}

bigrams[length(bigrams)+1] = str_c(x1[length(x1)], "\$", sep = "")

# Variable for all probabilities
probs = c()

for(bigram in bigrams){
probs[length(probs)+1] = sum(str_count(words, bigram)) /
sum(str_count(words, str_split(bigram, pattern = "")[[1]][1]))
}

return(log(prod(probs)))
}