The Lumber Room

"Consign them to dust and damp by way of preserving them"

Archive for January 2016

The generating function for Smirnov words (or: time until k consecutive results are the same)

with one comment

1. Alphabet

Suppose we have an alphabet {\mathcal{A}} of size {m}. Its generating function (using the variable {z} to mark length) is simply {A(z) = mz}, as {\mathcal{A}} contains {m} elements of length {1} each.

2. Words

Let {\mathcal{W}} denote the class of all words over the alphabet {\mathcal{A}}. There are many ways to find the generating function {W(z)} for {\mathcal{W}}.

2.1.

We have

\displaystyle \mathcal{W} = \{\epsilon\} + \mathcal{A} + \mathcal{A}\mathcal{A} + \mathcal{A}\mathcal{A}\mathcal{A} + \dots

so its generating function is

\displaystyle \begin{aligned} W(z) &= 1 + A(z) + A(z)^2 + A(z)^3 + \dots \\ &= 1 + mz + (mz)^2 + (mz)^3 + \dots \\ &= \frac{1}{1-mz} \end{aligned}

2.2.

To put it differently, in the symbolic framework, we have {\mathcal{W} = \textsc{Seq}(\mathcal{A})}, so the generating function for {\mathcal{W}} is

\displaystyle W(z) = \frac{1}{1 - A(z)} = \frac{1}{1-mz}.

2.3.

We could have arrived at this with direct counting: the number of words of length {n} is {W_n = m^n} as there are {m} choices for each of the {n} letters, so the generating function is

\displaystyle W(z) = \sum_{n \ge 0}W_n z^n = \sum_{n \ge 0} m^n z^n = \frac{1}{1-mz}.

3. Smirnov words

Next, let {\mathcal{S}} denote the class of Smirnov words over the alphabet {\mathcal{A}}, defined as words in which no two consecutive letters are identical. (That is, words {w_1w_2 \dots w_n} in which {w_i \in \mathcal{A}} for all {i}, and {w_i \neq w_{i-1}} for any {1 < i \le n}.) Again, we can find the generating function for {\mathcal{S}} in different ways.

3.1.

For any word in {\mathcal{W}}, by “collapsing” all runs of each letter, we get a Smirnov word. To put it differently, any word in {\mathcal{W}} can be obtained from a Smirnov word {w} by “expanding” each letter {w_i} into a nonempty sequence of that letter. This observation (see Analytic Combinatorics, pp. 204–205) lets us relate the generating functions of {\mathcal{W}} and {\mathcal{S}} as

\displaystyle W(z) = S(\frac{z}{1-z})

which implicitly gives the generating function {S(z)}: we have

\displaystyle S(z) = W(\frac{z}{1+z}) = \frac{1}{1-m\frac{z}{1+z}} = \frac{1+z}{1 - (m-1)z}.

3.2.

Alternatively, consider in an arbitrary word the first occurrence of a pair of repeated letters. Either this doesn’t happen at all (the word is a Smirnov word), or else, if it happens at position {i} so that {w_i = w_{i+1}}, then the part of the word up to position {i} is a nonempty Smirnov word, the letter at position {i+1} is the same as the previous letter, and everything after {i+1} is an arbitrary word. This gives

\displaystyle \mathcal{W} = \mathcal{S} + (\mathcal{S} \setminus \{ \epsilon \}) \cdot \mathcal{Z} \cdot \mathcal{W}

or in terms of generating functions

\displaystyle W(z) = S(z) + (S(z) - 1)zW(z)

giving

\displaystyle S(z) = \frac{W(z) (1 + z)}{1 + zW(z)} = \frac{1 + z}{(1-mz)(1 + \frac{z}{1-mz})} = \frac{1+z}{1 - (m-1)z}

3.3.

A minor variant is to again pick an arbitrary word and consider its first pair of repeated letters, happening (if it does) at positions {i} and {i+1}, but this time consider the prefix up to {i -1}: either it is empty, or the pair of letters is different from the last letter of the prefix, giving us the decomposition

\displaystyle \mathcal{W} = \mathcal{S} + m\mathcal{Z}^2 \cdot \mathcal{W} + (\mathcal{S}\setminus \{ \epsilon \}) \cdot (m-1)\mathcal{Z}^2 \mathcal{W}

and corresponding generating function

\displaystyle W(z) = S(z) + mz^2W(z) + (S(z) - 1)(m-1)z^2W(z)

so

\displaystyle S(z) = \frac{W(z)(1-z^2)}{1 + (m-1)z^2W(z)} = \frac{1-z^2}{1 - mz + (m-1)z^2} = \frac{(1-z)(1+z)}{(1-z)(1 - (m-1)z)}

which is the same as before after we cancel the {(1-z)} factors.

3.4.

We could have arrived at this result with direct counting. For {n \ge 1}, for a Smirnov word of length {n}, we have {m} choices for the first letter, and for each of the other {(n-1)} letters, as they must not be the same as the previous letter, we have {(m-1)} choices. This gives the number of Smirnov words of length {n} as {m (m-1)^{n-1}} for {n \ge 1}, and so the generating function for Smirnov words is

\displaystyle S(z) = 1 + \sum_{n \ge 1} m (m-1)^{n-1} z^n = 1 + mz \sum_{n \ge 1} (m-1)^{n-1}z^{n-1} = 1 + \frac{mz}{1-(m-1)z}

again giving

\displaystyle S(z) = \frac{1 + z}{1 - (m-1)z}

4. Words with bounded runs

We can now generalize. Let {\mathcal{S}_k} denote the class of words in which no letter occurs more than {k} times consecutively. ({\mathcal{S} = \mathcal{S}_1}.) We can find the generating function for {\mathcal{S}_k}.

4.1.

To get a word in {\mathcal{S}} we can take a Smirnov word and replace each letter with a nonempty sequence of up to {k} occurrences of that letter. This gives:

\displaystyle S_k(z) = S(z + z^2 + \dots + z^k) = S(z\frac{1-z^{k}}{1-z})

so

\displaystyle S_k(z) = \frac{1 + z\frac{1-z^{k}}{1-z}}{1 - (m-1)z\frac{1-z^{k}}{1-z}} = \frac{1 - z^{k+1}}{1 - mz + (m-1)z^{k+1}}.

4.2.

Pick any arbitrary word, and consider its first occurrence of a run of {k+1} letters. Either such a run does not exist (which means the word we picked is in {\mathcal{S}_k}), or it occurs right at the beginning ({m} possibilities, one for each letter in the alphabet), or, if it occurs starting at position {i > 1}, then the part of the word up to position {i-1} (the “prefix”) is a nonempty Smirnov word, positions {i} to {i+k} are {k+1} occurrences of any of the {m-1} letters other than the last letter of the prefix, and what follows is an arbitrary word. This gives

\displaystyle \mathcal{W} = \mathcal{S}_k + m\mathcal{Z}^{k+1} \cdot \mathcal{W} + (\mathcal{S}_k \setminus \{ \epsilon \}) \cdot (m-1)\mathcal{Z}^{k+1} \cdot \mathcal{W}

or in terms of generating functions

\displaystyle W(z) = S_k(z) + mz^{k+1}W(z) + (S_k(z) - 1)(m-1)z^{k+1}W(z)

so

\displaystyle W(z)(1 - z^{k+1}) = S_k(z) (1 + (m-1)z^{k+1} W(z))

giving

\displaystyle S_k(z) = \frac{W(z)(1-z^{k+1})}{1 + (m-1)z^{k+1}W(z)} = \frac{1-z^{k+1}}{1-mz + (m-1)z^{k+1}}

4.3.

Arriving at this via direct counting seems hard.

5. Words that stop at a long run

Now consider words in which we “stop” as soon we see {k} consecutive identical letters. Let the class of such words be denoted {\mathcal{U}} (not writing {\mathcal{U}_k} to keep the notation simple). As before, we can find its generating function in multiple ways.

5.1.

We get any word in {\mathcal{U}} by either immediately seeing a run of length {k} and stopping, or by starting with a nonempty prefix in {\mathcal{S}_{k-1}}, and then stopping with a run of {k} identical letters different from the last letter of the prefix. Thus we have

\displaystyle \mathcal{U} = m \mathcal{Z}^k + (\mathcal{S}_{k-1} \setminus \{\epsilon\}) \cdot (m-1)\mathcal{Z}^k

and

\displaystyle U(z) = m z^k + (S_{k-1}(z) - 1) (m-1) z^k

which gives

\displaystyle U(z) = z^k(1 + (m-1)S_{k-1}(z)) = z^k\left(1+(m-1)\frac{1-z^k}{1-mz+(m-1)z^k}\right) = \frac{m(1-z)z^k}{1 - mz + (m-1)z^k}

5.2.

Alternatively, we can decompose any word by looking for its first run of {k} identical letters. Either it doesn’t occur at all (the word we picked is in {\mathcal{S}_{k-1}}, or the part of the word until the end of the run belongs to {\mathcal{U}} and the rest is an arbitrary word, so

\displaystyle \mathcal{W} = \mathcal{S}_{k-1} + \mathcal{U} \cdot \mathcal{W}

and

\displaystyle W(z) = S_{k-1}(z) + U(z) W(z)

so

\displaystyle U(z) = 1 - \frac{S_{k-1}(z)}{W(z)} = 1 - \frac{(1-z^k)(1-mz)}{1-mz + (m-1)z^k} = \frac{m(1-z)z^k}{1 - mz + (m-1)z^k}

6. Probability

Finally we arrive at the motivation: suppose we keep appending a random letter from the alphabet, until we encounter the same letter {k} times consecutively. What can we say about the length {X} of the word thus generated? As all sequences of letters are equally likely, the probability of seeing any string of length {n} is {\frac{1}{m^n}}. So in the above generating function {U(z) = \sum_{n} U_n z^n}, the probability of our word having length {n} is {U_n / m^n}, and the probability generating function {P(z)} is therefore {\sum_{n} U_n z^n / m^n}. This {P(z)} can be got by replacing {z} with {z/m} in the expression for {U(z)}: we have

\displaystyle P(z) = U(z/m) = \frac{(m-z)z^k}{m^k(1-z) + (m-1)z^k}

In principle, this probability generating function tells us everything about the distribution of the length of the word. For example, its expected length is

\displaystyle \mathop{E}[X] = P'(1) = \frac{m^k - 1}{m - 1}

(See this question on Quora for other powerful ways of finding this expected value directly.)

We can also find its variance, as

\displaystyle \mathop{Var}[X] = P''(1) + P'(1) - P'(1)^2 = \frac{m^{2k} - (2k-1)(m-1)m^k - m}{(m-1)^2}

This variance is really too large to be useful, so what we would really like, is the shape of the distribution… to be continued.

Advertisement

Written by S

Sun, 2016-01-03 at 03:06:23

Posted in mathematics