Current Projects

My current projects list is out of control. I have replicated it below, with project statuses.

Internet Topology
It got me a PhD, and now its just sitting there. I haven’t touched it since my talk @Google, but it needs to get (re)written up and submitted to an actual journal. Status: only barely able to look at it without flashbacks. Let it sit for a bit longer.

Calculus Data Mining
I stole an idea from Titus Winters and used data mining techniques on student assessment data to build decision trees to discover “key concepts” that distinguish bad students from good students. This needs to get written up into a small note and sent … somewhere. Next step: Figure out where to send this, so that the writing can be tuned to the audience.

Whole program animation
Work with Sandro Badame (former student). We have built 90% of a system which can animate the execution of entire programs. It looks like we managed to do it in such a way that you get algoviz for free! But the system needs the last 10% which will take 90% of the work. Next step: CODE

Model for programs under memory pressure
Have an idea for how to model program behavior in the presence of large problems and virtual memory. The model is simple, but like all models it needs experimental validation. Experiments are being run, and the results will be submitted to ALENEX. Working on this (occasionally with Greg Rae). Next step: Run current experiments to completion and analyze the data. Build up the MergeSort experiment.

Growing Programs
An introduction to programming being written by me (and possibly others) with the idea in mind that programs are not written, they are not built, they are instead grown. I came up with this independently, but then read that Fred Brooks likes this metaphor as well, which made me feel like I was on the right track. This project has been stalled for quite some time. Next step: Figure out if there is demand for yet another CS1 book, albeit one with a different (and unique) approach.

24 Game
Turn the website into a little note about combinatorial enumeration and the 24 game. Initially for Math Magazine, but their acceptance rate is dismally low, so first try there and then see where else might want it. Next step: more writing.

Projects in stasis/completed:

Tree Animation
Did work with student Sandro Badame, presented it at Boca, and it is now under review. Could use some code cleanup.

Optimizing T-9 keyboards
Found out that optimizing cell phone keyboard usage is NP-complete sometimes and polynomial other times. Submitted to FUN2010, was accepted, was presented, has been published in the proceedings (and right here) Project is on GitHub.

BSD Heapsort is busted

And so are many other sorts, I bet. The libbsd heapsort can’t sort an array that is larger than 2 Gb. This is probably because they have a line somewhere like

 if (2*i+1 > size)

and with a 2Gb region, 2*i will overflow and return a negative number. It is a very strange experience to discover new bugs in 30 year old code.

It’s 2010, and mass storage still sucks

Let me tell you how mass storage should work: I stick in a new disk, power on the machine, and then the new disk is there waiting to be written to, having been identified as new and then joined into a storage cluster in which a group of non-identical disks mutually back each other up. In the event of hardware failure, I would purchase a new disk of equal or greater size to replace the failed one, and then, on bootup, that new disk would transparently take the place of the old one, although with perhaps even more space. The math for doing this is easy, but somehow the will isn’t there.

I have two hard drives in my desktop machine, and have decided to wipe my windows partition and use the two disks to back each other up. This halves my storage space but more than doubles my peace of mind, which is what I am trying to optimize anyway. Unfortunately, to do this I have to dive into totally stupid crap like bizarre mdadm commands and mkfs with bizarre options and no how-to. The fact that I need a how-to is unacceptable, but the fact that one doesn’t even exist is just dumb. And even should I eventually get it all set up, if one of the drives dies, I will not be able to replace it with anything except a drive of the exact same size. If I add another drive, things get harder and more annoying, and if I add even more drives, things became so abstruse and painful that I don’t even want to imagine it.

This sucks. Hey Linux people! I will totally use any mass storage solution which can easily support the addition and replacement of hard drives, and I don’t think I am alone here. I suck at this stuff – you don’t want me writing this code – but I will make it if I have to. I just really don’t want to have to.

OSX! Why isn’t this in there already? Windows! This is your chance to beat OSX in usability! Grab it with both hands!

It’s 2010 and mass storage is pretty much still stuck in the 80s.*

* Journaling filesystems only barely excepted – they can be stuck in the early 90s.

Ulam spiral, take 2

Mostly this is posted for its pleasantly recursive nature. In response to many people asking me how I made the Ulam Spiral, I have produced a a PDF of the spiral which includes the code to make said PDF. I really like how it turned out.


One subtle touch that fellow Python programmers will appreciate is that every column begins with a line of code that is not indented.

Ulam Spiral Algorithm

I looked at many things online which detail how to draw the Ulam Spiral, and all of them use various bad-sounding algorithms. I recently drew the Ulam Spiral and it went very quickly and the algorithm I used was both straightforward and basic-sounding. So I will reproduce it below so that others won’t have to suffer through wrong tutorials in the future. There are two big insights to our algorithm. The first is that we can figure out how many prime numbers we’ll need to generate based on the size of the dots and the size of our canvas, and the second is that, while figuring out a number’s location in the cartesian plane from first principles is tricky, if you just trace the spiral and increment the count as you go along, then you can largely avoid that problem. The Ulam Spiral

The first insight allows us to just use the Sieve of Eratosthenes (and not the over-clever not-really-the-sieve lazy one, but the totally basic eager one). The second insight allows us to just increment the counter as we walk around the squiral. Code is here for the interested, and the PDF it produces is getting printed out on my 24-inch roll-fed printer. I was thinking that I should trace the squiral in light gray to reinforce the pattern, and if that turns out to be really cool, I will reprint it.

Sieve of Eratosthenes

The Sieve of Eratosthenes, implemented lazily in Python

def integers_from(start):
    x = start
    while 1:
        yield x
        x += 1

def filter_out_multiples(stream, f):
    return (i for i in stream if i%f != 0)

numbers = (i for i in integers_from(2))

while numbers:
    p = numbers.next()
    print p

    numbers = filter_out_multiples(numbers, p)

This was done after a student wondered about how to make nicer my Java version of the same. An important thing here is that these implementations are lazy — no generating a huge array of numbers and then crossing them out. Instead, they take each number in turn and run it through an ever-increasing series of filters. I feel virtually certain that this kind of thing is exactly what Haskell was made for, but can’t quite incant it due to my total lack of Haskell experience. Any help? What about in your language of choice? Do Ruby blocks make this even simpler?

This particular problem seems to be of high enough difficulty while being conceptually easy that it might be a good addition to the list of programs one uses to show how to do X in a given language.

The Discoveries

I’ve been reading books on the subway too and from work. It is wonderful. I’ve been alternating between books of quality, and trashy scifi. I just finished my most recent “quality” book, and highly recommend it to all my fellow science nerds and enthusiasts.

In The Discoveries, Alan Lightman has collected the most groundbreaking scientific papers of the 20th century (with a mild bias towards physics) and has simply written an introduction to each one. It is simply fantastic to read all about Heisenberg’s uncertainty principle from the man himself (that would be Herr Heisenberg) or to read Bohr figuring out what atoms are really made of. The most interesting thing to me, aside from Alan Lightman’s superb gloss of the texts, was that many of the papers were incredibly readable. If you have taken calculus, intro physics, and intro chemistry, then you should be able to follow along! Why do physics texts ever do anything besides just reprint the original papers?

The answer, for the discoveries of the last quarter of the 20th century, is that the art of readable scientific writing has been lost. Or, at the very least, clear exposition is not rewarded and is therefore largely neglected in modern research. This makes me very sad. The public has paid for this research, the least we could do is make it as accessible as possible to readers. Some things are simply difficult, and are therefore also difficult to explain, but hiding behind a veil of jargon and false modesty does nobody any favors. If you are a scientist, then you’ve probably already seen some of these, but it is still wonderful to have them all in one place. Buy this book, read it, and have your mind blown once again at the fantastic progress humankind made in the 20th century. From the discovery of the vast distances in the cosmos through the nuclear age to the dawning of biotech, it is all here, and it is all from the people who actually figured it out.

It is a rare book that explains the working of the world, provides the history and context of these discoveries, and yet still manages to reinforce Haldane’s remark that “the Universe is not only queerer than we suppose, but queerer than we can suppose.”

Generate a random true boolean lisp statement

Just in case you want generate a random tautology, you can find python code to do so at: http://pastie.org/933675. It’s actually pretty sweet – it won’t generate all true statements, but it will generate all tautologies of a given depth. Also, it has some clever hacks that allow it to be random (instead of true) if it can get away with it. Consider the case of true OR b. In that case, it is totally okay for b to be false, and the algorithm respects that. The true tautology I generated (and am using for a tree visualization library) is: (and (not (and (not (or (or 1 1) (not 1))) (or (and (not 1) (and 0 0)) (or (or 1 1) (or 1 0))))) (not (and (and (or (not 1) (and 0 1)) (and (or 0 0) (or 1 1))) (or (or (and 1 0) (and 0 1)) (or (and 1 1) (or 0 0)))))), which I am very glad I did not have to come up with by hand.

Fun question for others to answer: Will this alternative generator also generate all possible trees of a given depth?

Biases in Curricula Design

In Mindstorms (the groundbreaking book, not the Lego set) Seymour Papert writes that part of what drives the mathematical curriculum is the technology available to students.

As I see it, a major factor that determined what mathematics went into school math was what could be done in the setting of school classrooms with the primitive technology of pencil and paper. For example, children can draw graphs with pencil and paper. So it was decidet to let children draw many graphs. The same considerations influenced the emphasis on certain kinds of geometry. For example, in school math “analytic geometry” has become synonymous with the representation of curves by equations. As a result every educated person vaguely remembers that y=x^2 is the equation for a parabola. And although most parents have very little idea why anyone should know this, they become indignant when their children do not. They assume that there must be a profound and objective reason known to those who better understand these things. [...] Very few people ever suspect that the reason for what is included and what is not included in school math might be as crudely technological as the ease of production of parabolas with pencils!

Somewhat dangerously, I disagree with Seymour Papert just a little bit. Or maybe I am a generation younger and the school system changed in the meantime. Either way, I think that the biases of our current educational system are less due to available technology, and more towards that which can be quantitatively tested. That which we can test, we now test extensively, and that which cannot be tested in a standardized way, we pass over. Unfortunately, this is exactly wrong if we want to help mold students which are better able to function in the future.

That which can be quantitatively tested is usually better done by a computer! They add faster and more accurately, they factor polynomials more quickly, draw functions more quickly, etc , etc, etc. It is only those things which we cannot test quantitatively where humans have any advantage over machines. Computers are pretty bad at coming up with useful models of a real-world situation, and they really suck at being creative (all by themselves, that is — human/computer teams can be fantastically creative and productive and superb modelers). The biases of our educational system are forcing students to become more computer-like, but these are actually the least-relevant skills for the future! Students, responding to this obvious disconnect between the skills and knowledge they need and the skills and knowledge they are taught then have a few paths: they can drop out; they can cynically treat school as a game; they can ignore school and teach themselves on the weekends; or they can cling to the belief that somehow this will all pay off in the end because it would just be insane if the goals of school and the needs of adult life were actually as divergent as it seems.

I did the last. But as near as I can tell, our current focus on quantitative testing all but ensures that the next generation of students will be worse-prepared to face the world than the current generation. By only teaching that which we can test, we optimize our educational system to teach the skills most likely to quickly become obsolete!

Ada Lovelace Day

On Ada Lovelace Day, one is encouraged to write about a great woman involved in computer science. That’s today, and I’ll take Barbara Liskov! She figured out what she, in her book “Program Development in Java”, modestly calls “the substitution principle”, but everyone else in the world calls “the Liskov substitution principle”.

It’s an insight that, like all good insights, ends up seeming obvious in retrospect. In particular, it states that if you have some new code which you would like to use to replace some old code, then, if you want to have a safe program and not touch any other parts, the following conditions must hold:

  • The new code isn’t more particular about it’s input than the old code
  • The new code doesn’t output things which would be invalid coming from the old code
  • The new code preserves all of the promises of the old code
  • If you couldn’t change a part of the old code, then you can’t change it in the new code either.

These rules are a little intimidating — but they start making sense when you think about them. Imagine, for instance, substituting one piece of A/V equipment for another. You couldn’t do a straight substitution if the new one couldn’t accept signals that the parent was okay with, and this corresponds to condition 1. You couldn’t do a straight substitution if the output of the new piece of equipment was in a format that the parent couldn’t output, and this corresponds to condition 2. The last two conditions are more peculiar to programming, but have a similar level of intuitiveness.

She figured out exactly what it means for software to be “swappable”, and for that and other reasons she was awarded the Turing award. But in particular, her research is of a kind that I think we need more of in CS: the formalization of observational research on how people program and what trips them up. As a science, we are still at the Linnaeus “naming things” stage. We need larger and better and more-full taxonomies of problems and solutions. For figuring out how a piece of the programming taxonomy could work, Liskov gets my vote for Ada Lovelace day…