In February 2014, four friends and I went to HackBeanpot and made a product called Adventur, which won the prestigious ‘Most Innovative’ prize:
Given a set of dense geographical history data, Adventur does two things:
- Provides a quantitative measure of how adventurous you are.
- Suggests places for you to visit that will make you more adventurous.
The idea came out of a brainstorming session Eric and I had at Tatte Bakery in Kendall Square.
Tatte Bakery is a coffee shop that’s very close to my apartment. I live in Beacon Hill, and Tatte is a 10 minute walk across the Charles River.
I’d been living in Beacon Hill for about five months, but only discovered Tatte a few weeks prior. I wondered what other great places I had been missing that were near routes I took daily.
Eric was wondering the same, and we decided to find out during HackBeanpot.
At its core, making Adventur meant solving the problem of finding anomalies in a set of geo data.
We reduced this to the problem of determining the probability that the user visits place n after visiting the previous n-1 places.
I also thought it was reasonable to assume that the nth place that the user visited was only ever dependent on the previous n-1 places.
These assumptions scream “markov model”.
So now, all we needed to do was represent the geo-data in a way that an n-gram model could work with.
But that didn’t mean we couldn’t implement it ourselves – so we did!
If you gave us a week of your geographic history, we could:
- Accurately predict where you’ll be next
- Figure out your daily routines
- Find places you visited that were statistically unusual
It took Eric and I the first 24 hours of the hackathon to do.
There were several problems with this algorithm, though:
- It was slow
- Python has a recursion limit
The algorithm needs dense geo-data to work correctly, but the denser the data, the slower the algorithm ran.
We ended up taking a uniform sampling of the data, and I tried to strike a balance between density and performance.
Even when sampling 1000 data points (not that many, when two weeks of data nets you 18,000 points), the algorithm took over five minutes to complete on my machine.
The sampling also reduced the accuracy of the model significantly.
Our goal was to be able to have users upload their geo-data, and then quickly see their results. This wasn’t working.
The first part of our algorithm requires you to do the following:
- Create a bounding box over the geo-data.
- Assign that bounding box an initial label.
- Partition that bounding box into four quadrants.
- Assign a label to each quadrant, prepended with the parent quadrant’s prefix.
- Repeat until each quadrant has a maximum of threshold data
points in it.
Taught to think functionally during my first year as an undergraduate, I knew that the best approach to this problem was a recursive one, so I wrote one.
It was pretty, had a tail-call, and worked.
But it was written in Python, and Python has a recursion limit, and no tail-call optimization.
So, when we started to try to use it on samples of lengths over 1000 data points, python would just break the computation.
We found this out at midnight on Saturday, about twelve hours before demos.
So, at midnight, after sleeping a whopping total of two hours the night before, we decided to pivot, and create our own algorithm.
Stay tuned for part two - I’ll describe the final algorithm we came up with and what tools we used to get there.