Computer Vision

As you probably know, TripleLift's value prop includes computer vision. This includes two pieces - decomposition and rendering. In other words, analyzing an image to figure out what's in it - and then using that to come up with the best way to render the image. In this Lift Letter, we discuss the decomposition.

Decomposition is the traditional use case for computer vision - training computers to figure out what's going on in an image and where. Think about your eyes and your brain - how do you have any idea that you're looking at a chair, especially if you've never seen this particular chair before, or a certain chair from a new angle, or in a new lighting condition. Very few examples of computer vision rely on remembering every possible combination of pictures of a thing, it'd be horribly inefficient and wouldn't work that well. Instead, it's about training a computer (or a living being) to understand what defines a chair generally, and how can it be discerned.

If you look around, you can consider what you see, in effect, as textures on objects. Objects themselves have boundaries that can be thought of as transitions in shading - this is more clear if you close one eye. When computers "look" at an image, they see pixel values - shades of red, green, blue, black - just numbers. In one type of computer vision analysis (deep learning with convolutional neural nets), the first step is creating a set of feature detectors. As discussed above, these represent an effort to detect the transitions between shades that might define the boundaries or significant features of an object. They're often implemented as a matrix of, effectively, light and dark. In the simplest sense, this would be a light pixel next to a dark pixel, or a light pixel above or below a dark pixel - but they're more complex in reality (the selection of these feature detectors is a huge part of the effectiveness of a good computer vision system). You can then apply these filters to an image to find areas that match the characteristics defined in the feature detector through a process known as convolution.

feature.png

The output gives you a different result, called a feature map. These feature maps are then plugged into a neural net. This is represented abstractly in the image below, where the left hand side is the inputs. The middle layers are processing nodes, that apply weights and send different signals to the next layer of nodes, and onwards, until you have your output.

neural.png

A big piece of a neural net is the training - you have to "convolve" a lot of chairs, and effectively train the neural net to set its weights and inputs to recognize the patterns that represent chairs. Deep learning is the process of having a very large number of intermediate elements, a very large number of nodes, and a huge training set - to create a highly trained system. The process here will be described another day, but the objective is create weights on data conveyed between nodes.

In the first step, we have some feature maps that represent some abstract things about an image, then the neural net takes that as input and is able to extract that it's likely a chair in the image, based on the relative positions and relative importance of different elements of the convolutions. This is only a small sample of what goes into computer vision, but it represents some of the important building blocks.