Let’s say you have an idea for an app, a website, or some hardware which uses computer vision. The good news is that there is an incredible amount of information to collect from a video stream. The bad news is that despite decades of research and really cool presentations at SIGGRAPH you’re likely going to need to develop and implement your algorithms from scratch. To understand our perspective on the matter, let me tell a brief story.
I remember sitting with the rest of the team watching the keynote live blog when Jobs announce the iPad 2. The bit about the front camera made us all excited.
Up until this point we had been using iPads to build interactive cardio experiences for people on cardio machines. As people would use their bikes, treadmills or ellipticals they would create vibrations in the machine which were transmitted to the iPad and to it’s accelerometer. Our app then used those readings to guess at your exercise frequency. It was alright, but not perfect, and it was missing other information we wanted like exercise phase and body lean.
We suspected if the device had a front camera we had a shot at discerning this information via computer vision. Now, at this point none of us had any special training in computer vision, but in hindsight this was to our advantage.
This is because the computer vision solution space is wide and shallow*. There are many, many possible computer vision tasks and the computer-aided solutions for them are generally only moderately sophisticated. Whats more, computer vision research tends to cluster around a few peaks in this solution space, iteratively adding on to research which came before.
It turns out our best solution for exercise tracking on cardio machines was not nearby the existing solution clusters. To get there by iterating on existing research would have been far less direct than successive creative guesses. We probably would have started with feature detection and tracking research and customized it to the application constraints. It would have been likely to violate existing patents and it worst of all it would have had unnecessary complexity and computational overhead.
The solution we ended up with takes advantage of our unique constraints while elegantly resisting the domain specific challenges. It’s efficient, robust to background noise and lighting, capable of detecting very minor exercise motions and best of all it’s only about a thousand lines of code. It did take quite a while to figure out the right thousand lines of code, and this video shows a bit of how we got there:
I’ll end on some tips for people looking at implementing computer vision in the real world:
- Check the research literature for exact solutions to your problem. It’s worth a shot, if only to understand what doesn’t work.
- Even if it’s covered, plan on understanding and implementing it yourself. There is very little chance it’s been written well enough for practical use, for your platform, and using your unique constraints.
- If it’s not covered, as most likely it will not be, try successive creative guesses in a rapid prototyping environment like Processing.
- As you port promising solutions onto your platform and start measuring performance, optimize the algorithm in C/Java before diving into ASM, GPU or NEON programming. Those options will really lock you down into your solution(s), and one thing we found was that when you’re using successive creative guesses to find increasingly direct solutions, you end up changing your algorithms frequently.
Feel free to comment on the article submission on hacker news.
* If you were working on AI, you could view the computer vision problem space as deep and narrow.
- See our new Kickstarter!