The Lumber Room

"Consign them to dust and damp by way of preserving them"

The generating function for Smirnov words (or: time until k consecutive results are the same)

leave a comment »

1. Alphabet

Suppose we have an alphabet {\mathcal{A}} of size {m}. Its generating function (using the variable {z} to mark length) is simply {A(z) = mz}, as {\mathcal{A}} contains {m} elements of length {1} each.

2. Words

Let {\mathcal{W}} denote the class of all words over the alphabet {\mathcal{A}}. There are many ways to find the generating function {W(z)} for {\mathcal{W}}.

2.1.

We have

\displaystyle \mathcal{W} = \{\epsilon\} + \mathcal{A} + \mathcal{A}\mathcal{A} + \mathcal{A}\mathcal{A}\mathcal{A} + \dots

so its generating function is

\displaystyle \begin{aligned} W(z) &= 1 + A(z) + A(z)^2 + A(z)^3 + \dots \\ &= 1 + mz + (mz)^2 + (mz)^3 + \dots \\ &= \frac{1}{1-mz} \end{aligned}

2.2.

To put it differently, in the symbolic framework, we have {\mathcal{W} = \textsc{Seq}(\mathcal{A})}, so the generating function for {\mathcal{W}} is

\displaystyle W(z) = \frac{1}{1 - A(z)} = \frac{1}{1-mz}.

2.3.

We could have arrived at this with direct counting: the number of words of length {n} is {W_n = m^n} as there are {m} choices for each of the {n} letters, so the generating function is

\displaystyle W(z) = \sum_{n \ge 0}W_n z^n = \sum_{n \ge 0} m^n z^n = \frac{1}{1-mz}.

3. Smirnov words

Next, let {\mathcal{S}} denote the class of Smirnov words over the alphabet {\mathcal{A}}, defined as words in which no two consecutive letters are identical. (That is, words {w_1w_2 \dots w_n} in which {w_i \in \mathcal{A}} for all {i}, and {w_i \neq w_{i-1}} for any {1 < i \le n}.) Again, we can find the generating function for {\mathcal{S}} in different ways.

3.1.

For any word in {\mathcal{W}}, by “collapsing” all runs of each letter, we get a Smirnov word. To put it differently, any word in {\mathcal{W}} can be obtained from a Smirnov word {w} by “expanding” each letter {w_i} into a nonempty sequence of that letter. This observation (see Analytic Combinatorics, pp. 204–205) lets us relate the generating functions of {\mathcal{W}} and {\mathcal{S}} as

\displaystyle W(z) = S(\frac{z}{1-z})

which implicitly gives the generating function {S(z)}: we have

\displaystyle S(z) = W(\frac{z}{1+z}) = \frac{1}{1-m\frac{z}{1+z}} = \frac{1+z}{1 - (m-1)z}.

3.2.

Alternatively, consider in an arbitrary word the first occurrence of a pair of repeated letters. Either this doesn’t happen at all (the word is a Smirnov word), or else, if it happens at position {i} so that {w_i = w_{i+1}}, then the part of the word up to position {i} is a nonempty Smirnov word, the letter at position {i+1} is the same as the previous letter, and everything after {i+1} is an arbitrary word. This gives

\displaystyle \mathcal{W} = \mathcal{S} + (\mathcal{S} \setminus \{ \epsilon \}) \cdot \mathcal{Z} \cdot \mathcal{W}

or in terms of generating functions

\displaystyle W(z) = S(z) + (S(z) - 1)zW(z)

giving

\displaystyle S(z) = \frac{W(z) (1 + z)}{1 + zW(z)} = \frac{1 + z}{(1-mz)(1 + \frac{z}{1-mz})} = \frac{1+z}{1 - (m-1)z}

3.3.

A minor variant is to again pick an arbitrary word and consider its first pair of repeated letters, happening (if it does) at positions {i} and {i+1}, but this time consider the prefix up to {i -1}: either it is empty, or the pair of letters is different from the last letter of the prefix, giving us the decomposition

\displaystyle \mathcal{W} = \mathcal{S} + m\mathcal{Z}^2 \cdot \mathcal{W} + (\mathcal{S}\setminus \{ \epsilon \}) \cdot (m-1)\mathcal{Z}^2 \mathcal{W}

and corresponding generating function

\displaystyle W(z) = S(z) + mz^2W(z) + (S(z) - 1)(m-1)z^2W(z)

so

\displaystyle S(z) = \frac{W(z)(1-z^2)}{1 + (m-1)z^2W(z)} = \frac{1-z^2}{1 - mz + (m-1)z^2} = \frac{(1-z)(1+z)}{(1-z)(1 - (m-1)z)}

which is the same as before after we cancel the {(1-z)} factors.

3.4.

We could have arrived at this result with direct counting. For {n \ge 1}, for a Smirnov word of length {n}, we have {m} choices for the first letter, and for each of the other {(n-1)} letters, as they must not be the same as the previous letter, we have {(m-1)} choices. This gives the number of Smirnov words of length {n} as {m (m-1)^{n-1}} for {n \ge 1}, and so the generating function for Smirnov words is

\displaystyle S(z) = 1 + \sum_{n \ge 1} m (m-1)^{n-1} z^n = 1 + mz \sum_{n \ge 1} (m-1)^{n-1}z^{n-1} = 1 + \frac{mz}{1-(m-1)z}

again giving

\displaystyle S(z) = \frac{1 + z}{1 - (m-1)z}

4. Words with bounded runs

We can now generalize. Let {\mathcal{S}_k} denote the class of words in which no letter occurs more than {k} times consecutively. ({\mathcal{S} = \mathcal{S}_1}.) We can find the generating function for {\mathcal{S}_k}.

4.1.

To get a word in {\mathcal{S}} we can take a Smirnov word and replace each letter with a nonempty sequence of up to {k} occurrences of that letter. This gives:

\displaystyle S_k(z) = S(z + z^2 + \dots + z^k) = S(z\frac{1-z^{k}}{1-z})

so

\displaystyle S_k(z) = \frac{1 + z\frac{1-z^{k}}{1-z}}{1 - (m-1)z\frac{1-z^{k}}{1-z}} = \frac{1 - z^{k+1}}{1 - mz + (m-1)z^{k+1}}.

4.2.

Pick any arbitrary word, and consider its first occurrence of a run of {k+1} letters. Either such a run does not exist (which means the word we picked is in {\mathcal{S}_k}), or it occurs right at the beginning ({m} possibilities, one for each letter in the alphabet), or, if it occurs starting at position {i > 1}, then the part of the word up to position {i-1} (the “prefix”) is a nonempty Smirnov word, positions {i} to {i+k} are {k+1} occurrences of any of the {m-1} letters other than the last letter of the prefix, and what follows is an arbitrary word. This gives

\displaystyle \mathcal{W} = \mathcal{S}_k + m\mathcal{Z}^{k+1} \cdot \mathcal{W} + (\mathcal{S}_k \setminus \{ \epsilon \}) \cdot (m-1)\mathcal{Z}^{k+1} \cdot \mathcal{W}

or in terms of generating functions

\displaystyle W(z) = S_k(z) + mz^{k+1}W(z) + (S_k(z) - 1)(m-1)z^{k+1}W(z)

so

\displaystyle W(z)(1 - z^{k+1}) = S_k(z) (1 + (m-1)z^{k+1} W(z))

giving

\displaystyle S_k(z) = \frac{W(z)(1-z^{k+1})}{1 + (m-1)z^{k+1}W(z)} = \frac{1-z^{k+1}}{1-mz + (m-1)z^{k+1}}

4.3.

Arriving at this via direct counting seems hard.

5. Words that stop at a long run

Now consider words in which we “stop” as soon we see {k} consecutive identical letters. Let the class of such words be denoted {\mathcal{U}} (not writing {\mathcal{U}_k} to keep the notation simple). As before, we can find its generating function in multiple ways.

5.1.

We get any word in {\mathcal{U}} by either immediately seeing a run of length {k} and stopping, or by starting with a nonempty prefix in {\mathcal{S}_{k-1}}, and then stopping with a run of {k} identical letters different from the last letter of the prefix. Thus we have

\displaystyle \mathcal{U} = m \mathcal{Z}^k + (\mathcal{S}_{k-1} \setminus \{\epsilon\}) \cdot (m-1)\mathcal{Z}^k

and

\displaystyle U(z) = m z^k + (S_{k-1}(z) - 1) (m-1) z^k

which gives

\displaystyle U(z) = z^k(1 + (m-1)S_{k-1}(z)) = z^k\left(1+(m-1)\frac{1-z^k}{1-mz+(m-1)z^k}\right) = \frac{m(1-z)z^k}{1 - mz + (m-1)z^k}

5.2.

Alternatively, we can decompose any word by looking for its first run of {k} identical letters. Either it doesn’t occur at all (the word we picked is in {\mathcal{S}_{k-1}}, or the part of the word until the end of the run belongs to {\mathcal{U}} and the rest is an arbitrary word, so

\displaystyle \mathcal{W} = \mathcal{S}_{k-1} + \mathcal{U} \cdot \mathcal{W}

and

\displaystyle W(z) = S_{k-1}(z) + U(z) W(z)

so

\displaystyle U(z) = 1 - \frac{S_{k-1}(z)}{W(z)} = 1 - \frac{(1-z^k)(1-mz)}{1-mz + (m-1)z^k} = \frac{m(1-z)z^k}{1 - mz + (m-1)z^k}

6. Probability

Finally we arrive at the motivation: suppose we keep appending a random letter from the alphabet, until we encounter the same letter {k} times consecutively. What can we say about the length {X} of the word thus generated? As all sequences of letters are equally likely, the probability of seeing any string of length {n} is {\frac{1}{m^n}}. So in the above generating function {U(z) = \sum_{n} U_n z^n}, the probability of our word having length {n} is {U_n / m^n}, and the probability generating function {P(z)} is therefore {\sum_{n} U_n z^n / m^n}. This {P(z)} can be got by replacing {z} with {z/m} in the expression for {U(z)}: we have

\displaystyle P(z) = U(z/m) = \frac{(m-z)z^k}{m^k(1-z) + (m-1)z^k}

In principle, this probability generating function tells us everything about the distribution of the length of the word. For example, its expected length is

\displaystyle \mathop{E}[X] = P'(1) = \frac{m^k - 1}{m - 1}

(See this question on Quora for other powerful ways of finding this expected value directly.)

We can also find its variance, as

\displaystyle \mathop{Var}[X] = P''(1) + P'(1) - P'(1)^2 = \frac{m^{2k} - (2k-1)(m-1)m^k - m}{(m-1)^2}

This variance is really too large to be useful, so what we would really like, is the shape of the distribution… to be continued.

Written by S

Sun, 2016-01-03 at 03:06:23

Posted in mathematics

Converting a data URL (aka data URI) to an image on the commandline (Mac OS X)

leave a comment »

This is trivial, but was awfully hard to find via Google Search. Eventually had to give up and actually think about it. :-)

So, a data-URI looks something like the following:

data:image/png;base64,[and a stream of base64 characters here]

The part after the comma is literally the contents of the file (image or whatever), encoded in base64, so all you need to do is run base64 --decode on that part.

For example, with the whole data URL copied to the clipboard, I can do:

pbpaste | sed -e 's#data:image/png;base64,##' | base64 --decode > out.png

to get it into a png file.

Written by S

Sun, 2015-09-27 at 19:58:17

Posted in compknow

Using Stellarium to make an animation / video

leave a comment »

(I don’t have a solution yet.)

I just wanted to show what the sky looks like over the course of a week.

On a Mac with Stellarium installed, I ran the following

/Applications/Stellarium.app/Contents/MacOS/stellarium --startup-script stellarium.ssc

with the following stellarium.ssc:

// -*- mode: javascript -*-
core.clear('natural');  // "atmosphere, landscape, no lines, labels or markers"
core.wait(5);
core.setObserverLocation('Ujjain, India');
core.setDate('1986-08-15T05:30:00', 'utc');
core.wait(5);
for (var i = 0; i < 2 * 24 * 7; i += 1) {
    core.setDate('+30 minutes');
    core.wait(0.5);
    core.screenshot('uj');
    core.wait(0.5);
}
core.wait(10);
core.quitStellarium();

It took a while (some 10–15 minutes) and created those 336 images in ~/Pictures/Stellarium/uj*, occupying a total size of about 550 MB. This seems a start, but Imagemagick etc. seem to choke on creating a GIF from such large data.

Giving up for now; would like to come back in future and figure out something better, that results in a smaller GIF.

Written by S

Mon, 2015-09-14 at 20:10:10

Science

with one comment

Written by S

Sun, 2015-08-23 at 15:54:35

Posted in Uncategorized

The state of static analysis on Python

leave a comment »

Here is a program that fails with UnboundLocalError three out of four times:

"""Module to demonstrate an error."""

import random

def zero_or_one():
    """Returns either 0 or 1 with equal probability."""
    return random.choice([0, 1])

def main():
    """Function to demonstrate an error."""
    if zero_or_one():
        first = 42
    for _ in range(zero_or_one()):
        second = 42
    print first, second

if __name__ == '__main__':
    main()

(Condensed from a real bug where, because of additional indentation, a variable assignment happened as the last line inside a loop, instead of the first line after it.)

I know of three tools that are popular: pychecker, pyflakes, and pylint. None of them say a single thing about this program. It is questionable whether ever (and if so, how often) code like the above is what the programmer intended.

The first of these, pychecker, is not on pip, and requires you to download a file from Sourceforge and run “python setup.py install”. Anyway, this is the output from the three programs:

/tmp% pychecker test.py
Processing module test (test.py)...

Warnings...

None
/tmp% pyflakes test.py
/tmp% pylint test.py
No config file found, using default configuration


Report
======
12 statements analysed.

Statistics by type
------------------

+---------+-------+-----------+-----------+------------+---------+
|type     |number |old number |difference |%documented |%badname |
+=========+=======+===========+===========+============+=========+
|module   |1      |1          |=          |100.00      |0.00     |
+---------+-------+-----------+-----------+------------+---------+
|class    |0      |0          |=          |0           |0        |
+---------+-------+-----------+-----------+------------+---------+
|method   |0      |0          |=          |0           |0        |
+---------+-------+-----------+-----------+------------+---------+
|function |2      |2          |=          |100.00      |0.00     |
+---------+-------+-----------+-----------+------------+---------+



Raw metrics
-----------

+----------+-------+------+---------+-----------+
|type      |number |%     |previous |difference |
+==========+=======+======+=========+===========+
|code      |11     |73.33 |11       |=          |
+----------+-------+------+---------+-----------+
|docstring |3      |20.00 |3        |=          |
+----------+-------+------+---------+-----------+
|comment   |0      |0.00  |0        |=          |
+----------+-------+------+---------+-----------+
|empty     |1      |6.67  |1        |=          |
+----------+-------+------+---------+-----------+



Duplication
-----------

+-------------------------+------+---------+-----------+
|                         |now   |previous |difference |
+=========================+======+=========+===========+
|nb duplicated lines      |0     |0        |=          |
+-------------------------+------+---------+-----------+
|percent duplicated lines |0.000 |0.000    |=          |
+-------------------------+------+---------+-----------+



Messages by category
--------------------

+-----------+-------+---------+-----------+
|type       |number |previous |difference |
+===========+=======+=========+===========+
|convention |0      |0        |=          |
+-----------+-------+---------+-----------+
|refactor   |0      |0        |=          |
+-----------+-------+---------+-----------+
|warning    |0      |0        |=          |
+-----------+-------+---------+-----------+
|error      |0      |0        |=          |
+-----------+-------+---------+-----------+



Global evaluation
-----------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

/tmp% 

Written by S

Tue, 2015-06-23 at 14:19:34

Posted in programming

Tagged with

Bhavabhuti on finding a reader

with 3 comments

Bhavabhūti, the 8th-century author of the very moving play Uttara-rāma-carita, has in one of his other works these lines, any author’s consolation that even if your work receives not enough praise today, someday the right sort of reader will come along, who will derive great joy or meaning from it.


ये नाम केचिदिह नः प्रथयन्त्यवज्ञां 
जानन्ति ते किमपि तान्प्रति नैष यत्नः ।
उत्पत्स्यते तु मम कोऽपि समानधर्मा 
कालो ह्ययं निरवधिर्विपुला च पृथ्वी ॥

ye nāma kecit iha naḥ prathayanti avajñām
jānanti te kim api tān prati na eṣa yatnaḥ |
utpatsyate tu mama ko api samāna-dharmā
kālo hi ayaṃ niravadhiḥ vipulā ca pṛthvī ||

Those who deride or ignore my work —
let them know: my efforts are not for them.
There will come along someone who shares my spirit:
the world is vast, and time endless.

This verse has become a favourite of many. It appears already in the first known anthology of Sanskrit verses (subhāṣita-collection), Vidyākara’s Subhāṣita-ratna-koṣa. (It’s numbered 1731 (= 50.34) in the edition by Kosambi and Gokhale, and translated by Ingalls.) Ingalls writes and translates (1965):

Of special interest are the verses of Dharmakīrti and Bhavabhūti, two of India’s most original writers, which speak of the scorn and lack of understanding which the writings of those authors found among contemporaries. To such disappointment Dharmakīrti replies with bitterness (1726, 1729), Bhavabhūti with the unreasoning hope of a romantic (1731). If the souls of men could enjoy their posthumous fame the one would now see his works admired even far beyond India, the other would see his romantic hope fulfilled.

Those who scorn me in this world
have doubtless special wisdom,
so my writings are not made for them;
but are rather with the thought that some day will be born,
since time is endless and the world is wide,
one whose nature is the same as mine.

A translation of this verse is also included in A. N. D. Haksar’s A Treasury of Sanskrit Poetry in English Translation (2002):

The Proud Poet

Are there any around who mock my verses?
They ought to know I don’t write for them.
Someone somewhere sometime will understand.
Time has no end. The world is big.
— translated by V. N. Misra, L. Nathan and S. Vatsyayan [The Indian Poetic Tradition, 1993]

Andrew Schelling has written of it in Manoa, Volume 25, Issue 2, 2013:

Critics scoff
at my work
and declare their contempt—
no doubt they’ve got
their own little wisdom.
I write nothing for them.
But because time is
endless and our planet
vast, I write these
poems for a person
who will one day be born
with my sort of heart.

“Criticism is for poets as ornithology is for the birds,” wrote John Cage. Bhavabhūti has scant doubt that future generations will honor his work. The reader who will arise, utpalsyate [sic], is somebody of the same faith, heart, or discipline, samānadharmā.

Just now also found it on the internet, here (2014) (misattributed to Bhartṛhari):

There are those who
treat my work with
studied indifference.
Maybe they know something,
but I’m not writing for them.
Someone will come around
who feels the way I do.
Time, after all, is limitless,
and fortune spreads far.

Finally, in Sadāsvada, written by my friend Mohan with some comments from me, this was included in one of our our very first posts (2012):

In his play Mālatīmādhava, he makes a point that deserves to be the leading light of anyone wishing to do something of value and is put off by discouragement. Standing beside the words attributed to Gandhi (“First they ignore you, then they laugh at you, then they fight you, then you win.”) and Teddy Roosevelt (“It is not the critic who counts…”), Bhavabhūti’s confidence in the future stands resplendent:

“They who disparage my work should know that it’s not for them that I did it. One day, there will arise someone who will truly know me: this world is vast, and time infinite.”

The quote by Roosevelt is:

It is not the critic who counts; not the man who points out how the strong man stumbles, or where the doer of deeds could have done them better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, who comes short again and again, because there is no effort without error and shortcoming; but who does actually strive to do the deeds; who knows great enthusiasms, the great devotions; who spends himself in a worthy cause; who at the best knows in the end the triumph of high achievement, and who at the worst, if he fails, at least fails while daring greatly, so that his place shall never be with those cold and timid souls who neither know victory nor defeat.

[Note on the text: in Vidyākara’s compilation the verse ends with “विपुला च लक्ष्मीः” (vipulā ca lakṣmīḥ) instead of “विपुला च पृथ्वी” (vipulā ca pṛthvī), but the actual source work Mālatī-mādhava has the latter, as do all quotations of this verse elsewhere (e.g. the काव्यप्रकाशः of Mammaṭa, the Sahityadarpana of Viśvanātha Kavirāja, the रसार्णवसुधाकरः of श्रीसिंहभूपाल), and that is what Ingalls uses: “For lakṣmīḥ, which utterly destroys the line, read pṛthvi with the printed texts of Māl.” Actually, most quotations have “utpatsyate ‘sti” in place of “utpatsyate tu”: “either will be born, or already exists…”.]

Written by S

Tue, 2015-06-16 at 11:16:58

Printing floating-point numbers

leave a comment »

Everyone knows that floating-point numbers, being discrete, cannot possibly exactly represent every real number. In particular, the usual binary (IEEE 754) floating point numbers cannot even exactly store all numbers exactly representable in decimal (e.g. 0.3 or 0.1, which are not dyadic rationals).

But what about printing them?

Just because the number stored internally is not 0.1 but the closest approximation to it (say as 0.100000001490116119384765625) doesn’t mean it should be printed as a long string, when “0.1” means exactly the same number.

This is a solved problem since 1990.

TODO: Write rest of this post.

Bryan O’Sullivan (of Real World Haskell fame):
http://www.serpentine.com/blog/2011/06/29/here-be-dragons-advances-in-problems-you-didnt-even-know-you-had/

Steel & White paper How to Print Floating-Point Numbers Accurately: https://lists.nongnu.org/archive/html/gcl-devel/2012-10/pdfkieTlklRzN.pdf

Their retrospective: http://grouper.ieee.org/groups/754/email/pdfq3pavhBfih.pdf

Burger & Dybvig:
http://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf

Florian Loitsch:
http://www.cs.tufts.edu/~nr/cs257/archive/florian-loitsch/printf.pdf

Russ Cox: http://research.swtch.com/ftoa

http://www.ryanjuckett.com/programming/printing-floating-point-numbers/
https://labix.org/python-nicefloat

Edit (Thanks Soma!): Printing Floating-Point Numbers: A Faster, Always Correct Method from POPL’16.

Written by S

Wed, 2015-04-01 at 16:50:47

Posted in Uncategorized

Follow

Get every new post delivered to your Inbox.

Join 134 other followers