The 2011 NCAA brackets via random walkers
It's tournament time so, as always, I'm looking for better ways to fill out my bracket. This year, I've decided to automate the process with Mathematica. I should emphasize at the outset that I have no idea how well this will work. Lots of people have put lots of effort into rankings, computerized and otherwise. Frequently, their brackets are fare no better than their girlfriends basing their choices on mascot names. Thus, I won't get my hopes up. Ultimately, I'm happy with my approach for three reasons:
• It's fun,
• It's mathematically motivated, and
• Ohio State wins!
The random walker algorithm
This specific algorithm I'm using is called the random walker algorithm. The mathematics behind this algorithm is described in detail in a paper by Thomas Callaghan, Peter Mucha, and Mason Porter that appeared in the American Mathematical Monthly in 2007. There is a preprint version on the arXive. The authors' main interest is college football and they maintain a blog discussing their applications in that domain.
The basic idea behind the random walker algorithm is based on a large collection of voters who each cast a single vote for their top team. With each game played by a voter's top team, the voter may choose to switch their vote to the opponent. We assume that if the opponent beats the voter's top team, then the voter is likely to switch to that opponent with probability p∈(1/2,1]. Thus, each voter performs a random walk on the collection of all teams. In the long term, a steady state is reached and we can rank the teams by the proportion of voters preferring a particular team.
As described in the paper, the steady state distribution can be realized as a vector in the null space of the matrix M defined by
![]()
where
is the number of games team i lost,
is the number of games team i won,
is the number of games played between teams i and j, and
is the number of times team i beat team j minus the number of times team j beat time i. (A tie counts 1/2 for both teams, but we need not worry about that thanks to overtime.)
In spite of the mathematical sophistication, this is really quite simple - way too simple in fact. Each walker votes for one team, rather than ranking a list of teams, for example. In spit of this simplicity, the authors make a compelling case for this technique. So let's try it!
Getting and processing the data
The first thing we need is all the raw data representing contest results. Ideally, it would be in a CSV or XML or spreadsheet file that can be easily imported into Mathematica. I've not been able to find such a convenient for, though. Of course, there are many sites, such as espn.com or cnnsi.com that provide scores and other statistics. Mathematica has the ability to import data right off of web pages so this should work. After a bit of searching I went with Ken Pomeroy's basketball ratings page at http://kenpom.com/, largely because it was the first page I found with all teams listed on a single page and all the team names consistent throughout. We can import all the necessary data as follows.
This takes a few minutes so I stored the team names and the data for easy retrieval later.
There's quite a bit of superfluous information in the data, so let's trim it.
It will be convenient to be able to go back and forth from team name to team number in the list.
For example, here's the number of my favorite team.
Here's the data associated with Ohio State.
Ohio State is just one of 345 teams.
The matrix and rankings
The matrix
While we can read it, the trimmedData is actually still a bit too verbose for our purposes. All we really care about for each game of a given team is the team number of the opponent and whether our team won or lost. Thus, we trim the data further to obtain just win/loss data.
We can now set up the random walker matrix M, taking p=3/4.
Consider, for example, Ohio State versus Michigan. These teams played each other three times with Ohio State winning all three contests. We already know that Ohio State is team number 3. Michigan is team number 115:
Thus,
should be
![]()
Let's check
We can perform a more complete check by examining the column sums. As explained in the paper, these should all be zero.
The rankings
OK, now for the fun part. We rank the teams. According to the paper, there should be a unique vector (up to scalar multiple) in the null space of M. Here it is:
This is just a list of 345 numbers in one-to-one correspondence with the 345 teams. We can use them as weights to rank the teams. Here are the top 10 teams corresponding to this ranking.
| {0.167398,Ohio St.} |
| {0.162738,Kansas} |
| {0.149171,San Diego St.} |
| {0.142452,Notre Dame} |
| {0.142349,Duke} |
| {0.14044,Pittsburgh} |
| {0.135169,Brigham Young} |
| {0.127344,Syracuse} |
| {0.125525,Connecticut} |
| {0.12141,North Carolina} |
| {0.119453,Purdue} |
| {0.118917,Louisville} |
| {0.118666,Florida} |
| {0.11651,Texas} |
| {0.115135,Kentucky} |
| {0.111923,Wisconsin} |
Looks pretty good!
Modelling the tournament
At this point, it's quite simple to fill out a bracket by hand by always picking the higher ranked team. Why not automate that procedure as well, though.
Setting up the tournament
The tournament, of course, can be modeled using a binary tree and trees can be easily modeled using nested lists. Here's a natural representation for the east bracket, for example.
We can show that this is a reasonable representation of a tournament by examining the TreeForm.
Here is the rest of the tournament.
Running the tournament
To run the tournament, we simply choose the better ranked (smaller number) team in each game. The playAndSow function compares two teams in round k, advances the winner to round k+1 and sows the two edges for representation of the tournament as a tree. To actually play the tournament, we simply repeatedly apply playAndSow.
Finally, we should be able to display the results by applying TreePlot to edges we've reaped.