The generating function for Smirnov words (or: time until k consecutive results are the same)
1. Alphabet
Suppose we have an alphabet of size
. Its generating function (using the variable
to mark length) is simply
, as
contains
elements of length
each.
2. Words
Let denote the class of all words over the alphabet
. There are many ways to find the generating function
for
.
2.1.
We have
so its generating function is
2.2.
To put it differently, in the symbolic framework, we have , so the generating function for
is
2.3.
We could have arrived at this with direct counting: the number of words of length is
as there are
choices for each of the
letters, so the generating function is
3. Smirnov words
Next, let denote the class of Smirnov words over the alphabet
, defined as words in which no two consecutive letters are identical. (That is, words
in which
for all
, and
for any
.) Again, we can find the generating function for
in different ways.
3.1.
For any word in , by “collapsing” all runs of each letter, we get a Smirnov word. To put it differently, any word in
can be obtained from a Smirnov word
by “expanding” each letter
into a nonempty sequence of that letter. This observation (see Analytic Combinatorics, pp. 204–205) lets us relate the generating functions of
and
as
which implicitly gives the generating function : we have
3.2.
Alternatively, consider in an arbitrary word the first occurrence of a pair of repeated letters. Either this doesn’t happen at all (the word is a Smirnov word), or else, if it happens at position so that
, then the part of the word up to position
is a nonempty Smirnov word, the letter at position
is the same as the previous letter, and everything after
is an arbitrary word. This gives
or in terms of generating functions
giving
3.3.
A minor variant is to again pick an arbitrary word and consider its first pair of repeated letters, happening (if it does) at positions and
, but this time consider the prefix up to
: either it is empty, or the pair of letters is different from the last letter of the prefix, giving us the decomposition
and corresponding generating function
so
which is the same as before after we cancel the factors.
3.4.
We could have arrived at this result with direct counting. For , for a Smirnov word of length
, we have
choices for the first letter, and for each of the other
letters, as they must not be the same as the previous letter, we have
choices. This gives the number of Smirnov words of length
as
for
, and so the generating function for Smirnov words is
again giving
4. Words with bounded runs
We can now generalize. Let denote the class of words in which no letter occurs more than
times consecutively. (
.) We can find the generating function for
.
4.1.
To get a word in we can take a Smirnov word and replace each letter with a nonempty sequence of up to
occurrences of that letter. This gives:
so
4.2.
Pick any arbitrary word, and consider its first occurrence of a run of letters. Either such a run does not exist (which means the word we picked is in
), or it occurs right at the beginning (
possibilities, one for each letter in the alphabet), or, if it occurs starting at position
, then the part of the word up to position
(the “prefix”) is a nonempty Smirnov word, positions
to
are
occurrences of any of the
letters other than the last letter of the prefix, and what follows is an arbitrary word. This gives
or in terms of generating functions
so
giving
4.3.
Arriving at this via direct counting seems hard.
5. Words that stop at a long run
Now consider words in which we “stop” as soon we see consecutive identical letters. Let the class of such words be denoted
(not writing
to keep the notation simple). As before, we can find its generating function in multiple ways.
5.1.
We get any word in by either immediately seeing a run of length
and stopping, or by starting with a nonempty prefix in
, and then stopping with a run of
identical letters different from the last letter of the prefix. Thus we have
and
which gives
5.2.
Alternatively, we can decompose any word by looking for its first run of identical letters. Either it doesn’t occur at all (the word we picked is in
, or the part of the word until the end of the run belongs to
and the rest is an arbitrary word, so
and
so
6. Probability
Finally we arrive at the motivation: suppose we keep appending a random letter from the alphabet, until we encounter the same letter times consecutively. What can we say about the length
of the word thus generated? As all sequences of letters are equally likely, the probability of seeing any string of length
is
. So in the above generating function
, the probability of our word having length
is
, and the probability generating function
is therefore
. This
can be got by replacing
with
in the expression for
: we have
In principle, this probability generating function tells us everything about the distribution of the length of the word. For example, its expected length is
(See this question on Quora for other powerful ways of finding this expected value directly.)
We can also find its variance, as
This variance is really too large to be useful, so what we would really like, is the shape of the distribution… to be continued.
[…] m-letter alphabet, and look for the first occurrence of k consecutive identical digits. I was with some effort able to find that the random variable X, denoting the time until we see k consecutive digits, has […]
Finding where the tail starts for a probability distribution, from its generating function – Math Solution
Sat, 2022-03-05 at 05:45:11