Meandering Reflections

The 2011 NCAA brackets via random walkers

It's tournament time so, as always, I'm looking for better ways to fill out my bracket.  This year, I've decided to automate the process with Mathematica.  I should emphasize at the outset that I have no idea how well this will work.  Lots of people have put lots of effort into rankings, computerized and otherwise.  Frequently, their brackets are fare no better than their girlfriends basing their choices on mascot names.  Thus, I won't get my hopes up.  Ultimately, I'm happy with my approach for three reasons:
   • It's fun,
   • It's mathematically motivated, and
   • Ohio State wins!

The random walker algorithm

This specific algorithm I'm using is called the random walker algorithm.  The mathematics behind this algorithm is described in detail in a paper by Thomas Callaghan, Peter Mucha, and Mason Porter that appeared in the American Mathematical Monthly in 2007.  There is a preprint version on the arXive.  The authors' main interest is college football and they maintain a blog discussing their applications in that domain.

The basic idea behind the random walker algorithm is based on a large collection of voters who each cast a single vote for their top team.  With each game played by a voter's top team, the voter may choose to switch their vote to the opponent.  We assume that if the opponent beats the voter's top team, then the voter is likely to switch to that opponent with probability p∈(1/2,1].  Thus, each voter performs a random walk on the collection of all teams. In the long term, a steady state is reached and we can rank the teams by the proportion of voters preferring a particular team.

As described in the paper, the steady state distribution can be realized as a vector in the null space of the matrix M defined by

NCAATourneyWithRW_1.gif

where NCAATourneyWithRW_2.gif is the number of games team i lost, NCAATourneyWithRW_3.gif is the number of games team i won, NCAATourneyWithRW_4.gif is the number of games played between teams i and j, and NCAATourneyWithRW_5.gif is the number of times team i beat team j minus the number of times team j beat time i.  (A tie counts 1/2 for both teams, but we need not worry about that thanks to overtime.)

In spite of the mathematical sophistication, this is really quite simple - way too simple in fact.  Each walker votes for one team, rather than ranking a list of teams, for example.  In spit of this simplicity, the authors make a compelling case for this technique.  So let's try it!

Getting and processing the data

The first thing we need is all the raw data representing contest results.  Ideally, it would be in a CSV or XML or spreadsheet file that can be easily imported into Mathematica.  I've not been able to find such a convenient for, though.  Of course, there are many sites, such as espn.com or cnnsi.com that provide scores and other statistics.  Mathematica has the ability to import data right off of web pages so this should work.  After a bit of searching I went with Ken Pomeroy's basketball ratings page at http://kenpom.com/, largely because it was the first page I found with all teams listed on a single page and all the team names consistent throughout.  We can import all the necessary data as follows.

NCAATourneyWithRW_6.gif

This takes a few minutes so I stored the team names and the data for easy retrieval later.

NCAATourneyWithRW_7.gif

NCAATourneyWithRW_8.gif

There's quite a bit of superfluous information in the data, so let's trim it.

NCAATourneyWithRW_9.gif

It will be convenient to be able to go back and forth from team name to team number in the list.

NCAATourneyWithRW_10.gif

For example, here's the number of my favorite team.

NCAATourneyWithRW_11.gif

NCAATourneyWithRW_12.gif

Here's the data associated with Ohio State.

NCAATourneyWithRW_13.gif

NCAATourneyWithRW_14.gif

Ohio State is just one of 345 teams.

NCAATourneyWithRW_15.gif

NCAATourneyWithRW_16.gif

The matrix and rankings

The matrix

While we can read it, the trimmedData is actually still a bit too verbose for our purposes.  All we really care about for each game of a given team is the team number of the opponent and whether our team won or lost.  Thus, we trim the data further to obtain just win/loss data.

NCAATourneyWithRW_17.gif

NCAATourneyWithRW_18.gif

We can now set up the random walker matrix M, taking p=3/4.

NCAATourneyWithRW_19.gif

Consider, for example, Ohio State versus Michigan.  These teams played each other three times with Ohio State winning all three contests.  We already know that Ohio State is team number 3.  Michigan is team number 115:

NCAATourneyWithRW_20.gif

NCAATourneyWithRW_21.gif

Thus, NCAATourneyWithRW_22.gif should be

NCAATourneyWithRW_23.gif

Let's check

NCAATourneyWithRW_24.gif

NCAATourneyWithRW_25.gif

We can perform a more complete check by examining the column sums.  As explained in the paper, these should all be zero.

NCAATourneyWithRW_26.gif

NCAATourneyWithRW_27.gif

The rankings

OK, now for the fun part.  We rank the teams.  According to the paper, there should be a unique vector (up to scalar multiple) in the null space of M.  Here it is:

NCAATourneyWithRW_28.gif

NCAATourneyWithRW_29.gif

This is just a list of 345 numbers in one-to-one correspondence with the 345 teams.  We can use them as weights  to rank the teams. Here are the top 10 teams corresponding to this ranking.

NCAATourneyWithRW_30.gif

{0.167398,Ohio St.}
{0.162738,Kansas}
{0.149171,San Diego St.}
{0.142452,Notre Dame}
{0.142349,Duke}
{0.14044,Pittsburgh}
{0.135169,Brigham Young}
{0.127344,Syracuse}
{0.125525,Connecticut}
{0.12141,North Carolina}
{0.119453,Purdue}
{0.118917,Louisville}
{0.118666,Florida}
{0.11651,Texas}
{0.115135,Kentucky}
{0.111923,Wisconsin}

Looks pretty good!

Modelling the tournament

At this point, it's quite simple to fill out a bracket by hand by always picking the higher ranked team.  Why not automate that procedure as well, though.

Setting up the tournament

The tournament, of course, can be modeled using a binary tree and trees can be easily modeled using nested lists.  Here's a natural representation for the east bracket, for example.

NCAATourneyWithRW_31.gif

We can show that this is a reasonable representation of a tournament by examining the TreeForm.

NCAATourneyWithRW_32.gif

NCAATourneyWithRW_33.gif

Here is the rest of the tournament.

NCAATourneyWithRW_34.gif

Running the tournament

To run the tournament, we simply choose the better ranked (smaller number) team in each game.  The playAndSow function compares two teams in round k, advances the winner to round k+1 and sows the two edges for representation of the tournament as a tree.  To actually play the tournament, we simply repeatedly  apply playAndSow.

NCAATourneyWithRW_35.gif

Finally, we should be able to display the results by applying TreePlot to edges we've reaped.

NCAATourneyWithRW_36.gif

NCAATourneyWithRW_37.gif

Spikey Created with Wolfram Mathematica 8.0