The Lumber Room

"Consign them to dust and damp by way of preserving them"

The generating function for Smirnov words (or: time until k consecutive results are the same)

1. Alphabet

Suppose we have an alphabet ${\mathcal{A}}$ of size ${m}$. Its generating function (using the variable ${z}$ to mark length) is simply ${A(z) = mz}$, as ${\mathcal{A}}$ contains ${m}$ elements of length ${1}$ each.

2. Words

Let ${\mathcal{W}}$ denote the class of all words over the alphabet ${\mathcal{A}}$. There are many ways to find the generating function ${W(z)}$ for ${\mathcal{W}}$.

2.1.

We have

$\displaystyle \mathcal{W} = \{\epsilon\} + \mathcal{A} + \mathcal{A}\mathcal{A} + \mathcal{A}\mathcal{A}\mathcal{A} + \dots$

so its generating function is

\displaystyle \begin{aligned} W(z) &= 1 + A(z) + A(z)^2 + A(z)^3 + \dots \\ &= 1 + mz + (mz)^2 + (mz)^3 + \dots \\ &= \frac{1}{1-mz} \end{aligned}

2.2.

To put it differently, in the symbolic framework, we have ${\mathcal{W} = \textsc{Seq}(\mathcal{A})}$, so the generating function for ${\mathcal{W}}$ is

$\displaystyle W(z) = \frac{1}{1 - A(z)} = \frac{1}{1-mz}.$

2.3.

We could have arrived at this with direct counting: the number of words of length ${n}$ is ${W_n = m^n}$ as there are ${m}$ choices for each of the ${n}$ letters, so the generating function is

$\displaystyle W(z) = \sum_{n \ge 0}W_n z^n = \sum_{n \ge 0} m^n z^n = \frac{1}{1-mz}.$

3. Smirnov words

Next, let ${\mathcal{S}}$ denote the class of Smirnov words over the alphabet ${\mathcal{A}}$, defined as words in which no two consecutive letters are identical. (That is, words ${w_1w_2 \dots w_n}$ in which ${w_i \in \mathcal{A}}$ for all ${i}$, and ${w_i \neq w_{i-1}}$ for any ${1 < i \le n}$.) Again, we can find the generating function for ${\mathcal{S}}$ in different ways.

3.1.

For any word in ${\mathcal{W}}$, by “collapsing” all runs of each letter, we get a Smirnov word. To put it differently, any word in ${\mathcal{W}}$ can be obtained from a Smirnov word ${w}$ by “expanding” each letter ${w_i}$ into a nonempty sequence of that letter. This observation (see Analytic Combinatorics, pp. 204–205) lets us relate the generating functions of ${\mathcal{W}}$ and ${\mathcal{S}}$ as

$\displaystyle W(z) = S(\frac{z}{1-z})$

which implicitly gives the generating function ${S(z)}$: we have

$\displaystyle S(z) = W(\frac{z}{1+z}) = \frac{1}{1-m\frac{z}{1+z}} = \frac{1+z}{1 - (m-1)z}.$

3.2.

Alternatively, consider in an arbitrary word the first occurrence of a pair of repeated letters. Either this doesn’t happen at all (the word is a Smirnov word), or else, if it happens at position ${i}$ so that ${w_i = w_{i+1}}$, then the part of the word up to position ${i}$ is a nonempty Smirnov word, the letter at position ${i+1}$ is the same as the previous letter, and everything after ${i+1}$ is an arbitrary word. This gives

$\displaystyle \mathcal{W} = \mathcal{S} + (\mathcal{S} \setminus \{ \epsilon \}) \cdot \mathcal{Z} \cdot \mathcal{W}$

or in terms of generating functions

$\displaystyle W(z) = S(z) + (S(z) - 1)zW(z)$

giving

$\displaystyle S(z) = \frac{W(z) (1 + z)}{1 + zW(z)} = \frac{1 + z}{(1-mz)(1 + \frac{z}{1-mz})} = \frac{1+z}{1 - (m-1)z}$

3.3.

A minor variant is to again pick an arbitrary word and consider its first pair of repeated letters, happening (if it does) at positions ${i}$ and ${i+1}$, but this time consider the prefix up to ${i -1}$: either it is empty, or the pair of letters is different from the last letter of the prefix, giving us the decomposition

$\displaystyle \mathcal{W} = \mathcal{S} + m\mathcal{Z}^2 \cdot \mathcal{W} + (\mathcal{S}\setminus \{ \epsilon \}) \cdot (m-1)\mathcal{Z}^2 \mathcal{W}$

and corresponding generating function

$\displaystyle W(z) = S(z) + mz^2W(z) + (S(z) - 1)(m-1)z^2W(z)$

so

$\displaystyle S(z) = \frac{W(z)(1-z^2)}{1 + (m-1)z^2W(z)} = \frac{1-z^2}{1 - mz + (m-1)z^2} = \frac{(1-z)(1+z)}{(1-z)(1 - (m-1)z)}$

which is the same as before after we cancel the ${(1-z)}$ factors.

3.4.

We could have arrived at this result with direct counting. For ${n \ge 1}$, for a Smirnov word of length ${n}$, we have ${m}$ choices for the first letter, and for each of the other ${(n-1)}$ letters, as they must not be the same as the previous letter, we have ${(m-1)}$ choices. This gives the number of Smirnov words of length ${n}$ as ${m (m-1)^{n-1}}$ for ${n \ge 1}$, and so the generating function for Smirnov words is

$\displaystyle S(z) = 1 + \sum_{n \ge 1} m (m-1)^{n-1} z^n = 1 + mz \sum_{n \ge 1} (m-1)^{n-1}z^{n-1} = 1 + \frac{mz}{1-(m-1)z}$

again giving

$\displaystyle S(z) = \frac{1 + z}{1 - (m-1)z}$

4. Words with bounded runs

We can now generalize. Let ${\mathcal{S}_k}$ denote the class of words in which no letter occurs more than ${k}$ times consecutively. (${\mathcal{S} = \mathcal{S}_1}$.) We can find the generating function for ${\mathcal{S}_k}$.

4.1.

To get a word in ${\mathcal{S}}$ we can take a Smirnov word and replace each letter with a nonempty sequence of up to ${k}$ occurrences of that letter. This gives:

$\displaystyle S_k(z) = S(z + z^2 + \dots + z^k) = S(z\frac{1-z^{k}}{1-z})$

so

$\displaystyle S_k(z) = \frac{1 + z\frac{1-z^{k}}{1-z}}{1 - (m-1)z\frac{1-z^{k}}{1-z}} = \frac{1 - z^{k+1}}{1 - mz + (m-1)z^{k+1}}.$

4.2.

Pick any arbitrary word, and consider its first occurrence of a run of ${k+1}$ letters. Either such a run does not exist (which means the word we picked is in ${\mathcal{S}_k}$), or it occurs right at the beginning (${m}$ possibilities, one for each letter in the alphabet), or, if it occurs starting at position ${i > 1}$, then the part of the word up to position ${i-1}$ (the “prefix”) is a nonempty Smirnov word, positions ${i}$ to ${i+k}$ are ${k+1}$ occurrences of any of the ${m-1}$ letters other than the last letter of the prefix, and what follows is an arbitrary word. This gives

$\displaystyle \mathcal{W} = \mathcal{S}_k + m\mathcal{Z}^{k+1} \cdot \mathcal{W} + (\mathcal{S}_k \setminus \{ \epsilon \}) \cdot (m-1)\mathcal{Z}^{k+1} \cdot \mathcal{W}$

or in terms of generating functions

$\displaystyle W(z) = S_k(z) + mz^{k+1}W(z) + (S_k(z) - 1)(m-1)z^{k+1}W(z)$

so

$\displaystyle W(z)(1 - z^{k+1}) = S_k(z) (1 + (m-1)z^{k+1} W(z))$

giving

$\displaystyle S_k(z) = \frac{W(z)(1-z^{k+1})}{1 + (m-1)z^{k+1}W(z)} = \frac{1-z^{k+1}}{1-mz + (m-1)z^{k+1}}$

4.3.

Arriving at this via direct counting seems hard.

5. Words that stop at a long run

Now consider words in which we “stop” as soon we see ${k}$ consecutive identical letters. Let the class of such words be denoted ${\mathcal{U}}$ (not writing ${\mathcal{U}_k}$ to keep the notation simple). As before, we can find its generating function in multiple ways.

5.1.

We get any word in ${\mathcal{U}}$ by either immediately seeing a run of length ${k}$ and stopping, or by starting with a nonempty prefix in ${\mathcal{S}_{k-1}}$, and then stopping with a run of ${k}$ identical letters different from the last letter of the prefix. Thus we have

$\displaystyle \mathcal{U} = m \mathcal{Z}^k + (\mathcal{S}_{k-1} \setminus \{\epsilon\}) \cdot (m-1)\mathcal{Z}^k$

and

$\displaystyle U(z) = m z^k + (S_{k-1}(z) - 1) (m-1) z^k$

which gives

$\displaystyle U(z) = z^k(1 + (m-1)S_{k-1}(z)) = z^k\left(1+(m-1)\frac{1-z^k}{1-mz+(m-1)z^k}\right) = \frac{m(1-z)z^k}{1 - mz + (m-1)z^k}$

5.2.

Alternatively, we can decompose any word by looking for its first run of ${k}$ identical letters. Either it doesn’t occur at all (the word we picked is in ${\mathcal{S}_{k-1}}$, or the part of the word until the end of the run belongs to ${\mathcal{U}}$ and the rest is an arbitrary word, so

$\displaystyle \mathcal{W} = \mathcal{S}_{k-1} + \mathcal{U} \cdot \mathcal{W}$

and

$\displaystyle W(z) = S_{k-1}(z) + U(z) W(z)$

so

$\displaystyle U(z) = 1 - \frac{S_{k-1}(z)}{W(z)} = 1 - \frac{(1-z^k)(1-mz)}{1-mz + (m-1)z^k} = \frac{m(1-z)z^k}{1 - mz + (m-1)z^k}$

6. Probability

Finally we arrive at the motivation: suppose we keep appending a random letter from the alphabet, until we encounter the same letter ${k}$ times consecutively. What can we say about the length ${X}$ of the word thus generated? As all sequences of letters are equally likely, the probability of seeing any string of length ${n}$ is ${\frac{1}{m^n}}$. So in the above generating function ${U(z) = \sum_{n} U_n z^n}$, the probability of our word having length ${n}$ is ${U_n / m^n}$, and the probability generating function ${P(z)}$ is therefore ${\sum_{n} U_n z^n / m^n}$. This ${P(z)}$ can be got by replacing ${z}$ with ${z/m}$ in the expression for ${U(z)}$: we have

$\displaystyle P(z) = U(z/m) = \frac{(m-z)z^k}{m^k(1-z) + (m-1)z^k}$

In principle, this probability generating function tells us everything about the distribution of the length of the word. For example, its expected length is

$\displaystyle \mathop{E}[X] = P'(1) = \frac{m^k - 1}{m - 1}$

(See this question on Quora for other powerful ways of finding this expected value directly.)

We can also find its variance, as

$\displaystyle \mathop{Var}[X] = P''(1) + P'(1) - P'(1)^2 = \frac{m^{2k} - (2k-1)(m-1)m^k - m}{(m-1)^2}$

This variance is really too large to be useful, so what we would really like, is the shape of the distribution… to be continued.

Written by S

Sun, 2016-01-03 at 03:06:23

Posted in mathematics