When machine learning encounters computer vision

2022-12-08

Computer vision is the field of developing computer algorithms for automatically understanding image content, which evolved from artificial intelligence and cognitive neuroscience around the 1960s. In order to "solve" visual problems, Massachusetts Institute of Technology (MIT) launched a summer project in 1966. The project received widespread attention at the time, but people found that their understanding of the field was still very limited, and there was still a long way to go before this project matured. Today, 50 years later, the issue of a general understanding of images has not been perfectly resolved, but the field is still thriving. With the successful commercialization of projects such as interactive segmentation, which can be used to "remove background" in Microsoft Office, image search, face detection, focusing, and Kinect for capturing human movements, the field of computer vision has made tremendous progress, and visual algorithms have also begun to receive widespread attention. The main source of this wave of computer vision development is the rapid rise of machine learning (ML) in the past 15 to 20 years.

The first article in this series will explore some of the challenges faced by computer vision and will involve the powerful artificial intelligence technology of decision forests for pixel classification.

image classification

Please try to answer the following image classification question: "Is there a car in this picture?". For computers, an image is simply a grid containing red, green, and blue pixels, each represented by a number from 0 to 255. These numbers depend not only on whether the object appears, but also on some interfering factors such as the camera's perspective, lighting conditions, background, and target pose. In addition, we also need to deal with the changes in the appearance of the car caused by different types. For example, a car may be a station wagon, a small truck, or a small sedan, each of which corresponds to a very different pixel grid

Fortunately, supervised machine learning provides an alternative solution to manually writing code to solve these problems that involve countless possibilities. By collecting images to generate a training set and appropriately labeling each image in the training set, we can use appropriate machine learning algorithms to identify which pixel patterns are relevant to our recognition work and which are interfering factors. When learning the invariance of interfering factors, we hope to generalize the image into new test examples of the objects we are concerned about through learning. At present, we have made certain progress in visual learning algorithms, dataset collection, and labeling.

Decision Forest for Pixel Classification

Images contain details at many levels. As mentioned earlier, we can ask a question similar to whether the entire image contains a specific category of objects (such as cars). But we can also try to solve a seemingly more difficult problem, the "semantic image segmentation" problem: depicting all objects in the scene. Here is an example of street scene segmentation:

It can be imagined that semantic image segmentation technology will help us selectively edit photos, and even synthesize new images. We will now see some more applications.

There are many methods to solve semantic segmentation problems, but pixel classification is a key foundation: training a classifier at each pixel point to predict the distribution of object categories (such as cars, roads, trees, walls, etc.). This work poses some computational challenges for machine learning. Especially when the image contains a large number of pixels (such as the Nokia 1020 smartphone being able to capture an image of 40 million pixels). This means we potentially have millions of times more training and testing cases than using the entire image for classification.

The scale of this problem leads us to study a specific and effective classification model, decision forest (also known as random forest, random decision forest). A decision forest is a collection of multiple trained decision trees:

Each tree has a root node, multiple internal "split" nodes, and multiple terminal "leaf" nodes. Test classification starts from the root node and calculates some binary "split functions" of the data. These "split functions" are simple questions such as "Is the R value of this pixel higher than that of its neighboring points?". Depending on the outcome of the binary decision, the tree will branch left or right and continue to query the next splitting function, repeating the loop. When it finally reaches a leaf node, a stored prediction will be output, which is usually a histogram about the class labels. (You can refer to Chris Burges' recent excellent paper on the ascending mutation of search ranking decision trees)

The advantage of decision trees lies in their high efficiency in testing time: although there are exponentially many possible paths from the root node to the leaf node, any single test pixel can only follow one path. In addition, the calculation of the split function is based on previous results: for example, the classifier will ask a more appropriate question based on the results of previous questions. This is similar to the strategy used in the game "Twenty Questions": when you are only allowed to ask a small number of questions, you can quickly get a correct result by constantly adjusting the questions based on previous answers.

By using this technology, we have achieved remarkable results in handling various issues such as semantic segmentation of photos, segmentation of street scenes, segmentation of human anatomy in 3D medical scanning, repositioning of cameras, and segmentation of human parts in Kinect depth images. For Kinect, the efficiency of decision tree testing time is crucial: our computing budget is quite tight, but conditional computing combined with pixel parallel processing technology on Xbox GPU means we can solve problems within our budget.

Original author:ML Blog Team

http://article.yeeyan.org/view/406438/443422/

Home

About Us

About Us

Products

Products

News

News

Customer case

Contact Us

Contact Us

When machine learning encounters computer vision

Quick Link

Product category

Contact Us