The strangest and least strange French words according to n-grams - Antoine Amarilli's blog - http://a3nm.net/blog/ubac.html
Dec 14, 2014
from
Amira,
imabonehead,
Davide in the TARDIS,
Hami,
John (bird whisperer),
Sean McBride,
and
dāsnake
liked this
"The (multi)set of (character-level) n-grams of a word consists of its sequences of n consecutive characters. For instance, the 2-grams of "gram" are "gr", "ra", and "am". Duplicates are counted, e.g., for "toto" the 2-grams are "to", "ot", "to". Given the dictionary of all words in a language, we can compute the multiset of all n-grams."
- Maitani
"It turns out that this multiset is quite characteristic of the language. For instance, to identify the language of a piece of text, it is often enough to compute its n-grams, normalize it as a frequency distribution, and compare it to the distribution of known languages: usually the closest distribution is that of the language in which the text is written."
- Maitani
up
- Maitani