Let’s say you have an idea for an app, a website, or some hardware which uses computer vision. The good news is that there is an incredible amount of information to collect from a video stream. The bad news is that despite decades of research and really cool presentations at SIGGRAPH you’re likely going to need to develop and implement your algorithms from scratch. To understand our perspective on the matter, let me tell a brief story.

I remember sitting with the rest of the team watching the keynote live blog when Jobs announce the iPad 2. The bit about the front camera made us all excited.

facetime_slide

 

Up until this point we had been using iPads to build interactive cardio experiences for people on cardio machines. As people would use their bikes, treadmills or ellipticals they would create vibrations in the machine which were transmitted to the iPad and to it’s accelerometer. Our app then used those readings to guess at your exercise frequency. It was alright, but not perfect, and it was missing other information we wanted like exercise phase and body lean.

We suspected if the device had a front camera we had a shot at discerning this information via computer vision. Now, at this point none of us had any special training in computer vision, but in hindsight this was to our advantage.

cv_fan

This is because the computer vision solution space is wide and shallow*.  There are many, many possible computer vision tasks and the computer-aided solutions for them are generally only moderately sophisticated. Whats more, computer vision research tends to cluster around a few peaks in this solution space, iteratively adding on to research which came before.

It turns out our best solution for exercise tracking on cardio machines was not nearby the existing solution clusters. To get there by iterating on existing research would have been far less direct than successive creative guesses.  We probably would have started with feature detection and tracking research and customized it to the application constraints. It would have been likely to violate existing patents and it worst of all it would have had unnecessary complexity and computational overhead.

The solution we ended up with takes advantage of our unique constraints while elegantly resisting the domain specific challenges. It’s efficient, robust to background noise and lighting, capable of detecting very minor exercise motions and best of all it’s only about a thousand lines of code. It did take quite a while to figure out the right thousand lines of code, and this video shows a bit of how we got there:

I’ll end on some tips for people looking at implementing computer vision in the real world:

  1. Check the research literature for exact solutions to your problem. It’s worth a shot, if only to understand what doesn’t work.
  2. Even if it’s covered, plan on understanding and implementing it yourself. There is very little chance it’s been written well enough for practical use, for your platform, and using your unique constraints.
  3. If it’s not covered, as most likely it will not be, try successive creative guesses in a rapid prototyping environment like Processing.
  4. As you port promising solutions onto your platform and start measuring performance, optimize the algorithm in C/Java before diving into ASM, GPU or NEON programming. Those options will really lock you down into your solution(s), and one thing we found was that when you’re using successive creative guesses to find increasingly direct solutions, you end up changing your algorithms frequently.

Feel free to comment on the article submission on hacker news.

* If you were working on AI, you could view the computer vision problem space as deep and narrow.

 
  • Terry A Davis

    AI is tin foil hat. Bible is calm, sensible tried and true, not bat-shit pathetically crazy

  • DrBalthar

    There are quiet a few examples that proof you wrong though. Check half of Adobe’s research site. They published tons of stuff in the CV/graphics community and lots of this research ended up in their products sooner or later (usually a couple years behind). Same goes with quiet a few other graphics-related application companies. Also lots of the basic CV stuff has been published 10 years and more ago and slowly make their way into mainstream applications.

  • immersivelabs

    Hey guys,

    We created a platform to address the complexity of computer vision in real world environments. If you’re looking to detect faces, gender, age and emotions you can check it out here > http://imrsv.com

    It’s able to detect 25 faces up to 25 ft away using any standard mobile webcam. Currently support Windows, Linux and Android.

  • Adrian Rosebrock

    The difference here is that doing research and publishing a paper in academia is substantially different than the entrepreneur aspect of launching a product. Developers build the actual product, but the libraries aren’t available for standard developers to implement the current state-of-the-art algorithms, hence implementing them from scratch (as your post discussed). To truly see a product straight from academia into industry in less than a year or two, you need a person that understands the gap between academia and industry, who likely has had experience on both. A person who understands the technical aspects, yet can balance that with practicality. I actually blog a lot about this type of stuff over at PyImageSearch.