Computer Vision and Computer Vision applied

Prakash Kagitha
Akaike Technologies

--

Computer Vision is a scientific endeavor that aims to automate the human visual system, not necessarily aims to imitate it but emulate all the ability of the human visual system and beyond. So, as is the case with any field which has the potential to change the course of humanity, Computer vision has a lot of history made up of the likes of obsessive human effort, hard problems, victories, failures, and immense hope.

In spite of the complexity of visual scenes in the world around us, Computer vision is now capable of detecting practically any type of object including people, vehicles, and so on. Not only that it can detect objects it can recognize the characteristics of those objects like identity and gestures in the case of humans, deformations/defects in the case of objects, intrusions into boundaries, and much more. This is just a small part of an ocean that is applied Computer Vision.

Computer vision is the visual system for computers to play very strategic games like Go and Starcraft better than humans [1]. It can pave a way for self-driving cars to take automatic actions that don’t so that they won’t run into other cars or people. It seems that Computer vision can do anything as long as we have data and computation. Models learning to classify discriminatively and models that actually learn the whole distribution that generated the dataset both have shown significant progress in solving many real-world problems.

As a matter of fact, we at Akaike Technologies, believe that it can solve a lot of problems as long as we could formulate them. As a deep learning services company, we deployed various applications containing object detection, image segmentation, and machine inspection among others with computer vision. Alongside that, at the frontiers of computer vision, we know what are the current limitations and are ambitious to take the field further towards the collective vision of the community.

Now, we wish to delve into how computer vision evolved to be so successful. How started as an application in itself became a great scientific endeavor. And also what are its limitations and prospects, inclining towards its application to real-world problems around us.

Initial aspirations for Computer Vision (1950s-2012)

Computer vision emerged not so later than the field which aimed to make intelligent machines, Artificial Intelligence. In the 1950s and 1960s, there were small computer vision projects in the labs of AI pioneers like Marvin Minsky and Seymour Papert among others. We could see the optimism in 1956 when Seymour Papert wrote a proposal for imparting visual intelligence to machines describing it as a research project our a single summer [2]. Of course, it turned that it has a lot more to it.

There was great work done in David Marr's lab, who dealt with the problem from a neuroscience perspective [3]. Very prominent people like Marvin Minsky and Patrick Winston among others also worked on visual systems for robots and were betting on the idea that even visual intelligence could be symbolized and then be reasoned about as a top-down approach of visual intelligence [4], which, despite its potential, didn’t deliver any considerable development in Computer vision considering how much we have been able to do now. Alternatively, the bottom-up approach which is loosely modeled from processing units in the brain, neurons, composed with end-to-end learning mechanisms dominated the field as we discuss in the next section.

From the 1980s to even 2010, the efficient way to do computer vision, more specifically object detection, scene understanding, or classifying images on different schemas, is to develop mechanisms to extract some broad range of relevant features with expert domain knowledge and apply traditional statistics/machine learning on top of those feature characteristics [5]. These models, even though worked for some cases, needed a lot of domain-specific knowledge and manual engineering, and were so brittle outside their very specific problem.

The success of CNNs for Computer Vision

In 2012, the paradigm shift happened popularly when Convolutional Neural Networks(CNNs), a variant of neural networks with characteristics suitable for performing very well on images, reduced the error rate on a widely studied object recognition dataset called ImageNet by 50% [6]. From around that point, all or most of the computer vision turned to convolutional neural networks. This is more dramatic than we could imagine. Yan Lecun, the inventor of convolutional neural networks, was told, even though his approach(CNNs) achieved state of the art, that it doesn’t worth a publication in CVPR (a very respectable conference of Computer Vision) because it didn’t tell us anything about the visual system. This is in 2010. Now, it’s hard to find a CVPR paper that doesn’t contain Convolutional neural networks.

This very successful approach didn’t emerge overnight but we couldn’t say the same for the realization of emergence at all. The notion of creating intelligent machines by connecting brain like units, neurons, in groups is very old. It used to be called Connectionism [7]. The notion of convolutions which turned out efficient for computer vision is also not new. It is often credited to Fukushima coming up with or to formalize the idea related to convolutions in a network of neurons to make the model shift-invariant [8]. But a significant structure to the idea of convolution neural networks came from the work of Yan Lecun, credited as the inventor CNNs. This is in 1974.

As Prof. Lecun’s paper got rejected even in 2010, imagine the ideology in the 1970s. Many people devoted their life to Neural networks (and CNNs) but they don’t know any practical ways to train them so that they could perform on par with manually engineered systems or systems based on classical machine learning. Adding to that, there was a symbolic approach to creating intelligent machines with very prominent supporters, as we talked about in the last section, which slowed the progress of connectionism.

The upper part shows how a kernel is applied to an image that turns it into a feature map of compressed size. The lower part of the image describes the sequence of those operations (hence the name deep) to classify which number an image contains. This is a Deep Convolutional Neural Network which conceptually very similar to networks we use today. This work is from 1990.

With the effort of several researchers over several decades, the efficacy of connectionism(CNNs) became evident in smaller specific problems. Yan Lecun, in 1990, used convolutional neural networks to recognize digits to automatically understand the zipcode on checks at AT&T [9]. Among many people in the community, Yoshua Bengio and Geoffrey Hinton contributed a significant amount to learning algorithms behind these networks, insights about their inabilities and techniques to overcome them [10][11], to eventually cause the rise of Neural Networks, effectively, to be an answer for everything.

In 2012, two of the students from Geoffrey Hinton’s lab applied these techniques on an object detection dataset called Imagenet [6]. This is the most dramatic moment that started the Deep Learning revolution we see around us. The availability of data and the economic feasibility of computation are big factors for the whole field of deep learning to be practical and disruptive.

The 2012 paper that caused a 50% decrease in the error rate on the Imagenet dataset. (Left) Some of the examples from the evaluation set. (Right) Rows of nearest neighbors of last hidden layer encodings. The leftmost image is the quey and others are top N results [6].

Astonishingly, the reason why Yan Lecun’s CVPR paper got rejected in 2010 is the reason for the success of computer vision. Specifically, we can credit the success to one fundamental aspect of the deep learning revolution, i.e models learning everything end-to-end without any of the hand-engineering that was necessary for computer vision till now. With CNNs, models will automatically learn the features either discriminatively, when classifying, or generatively when actually have to generate natural images. We just have to define a task, prepare some data, and the model will learn everything that’s needed. This is the formula that brought down the revolution of learning machines and presumably a change in the course of humanity.

As shown below, the model learns to detect simple features and then effectively combines them to detect more complex features in a subsequent layer. In the end, if you have another layer to predict the probabilities of each image being in a particular class, we have a model that can classify images. The same paradigm applies even when you are doing a very complicated classification of whether an image of the object has a defect or not. Also, these models can be extended to mark the boundaries of different objects present in the image to masking objects and then tracking them. Effectively, everything comes down to automatically learning the features specific to the task end to end.

Below is the visualization of the filters that learned to detect specific things in each of the layers of Deep Convolutional Neural Network:

Layer 1 learned to detect plane surfaces, edges. Layer 2 weights learned to detect some contours [12].
Layer 3 learned to detect a combination of contours, sometimes meaningful parts of objects [12].
Layer 4 and 5 learned to detect parts or the whole object of some class [12].

Computer Vision community as a whole bet that this is the way to deal with everything. The formula is to define a task, get some data, and apply Neural Networks(CNNs). It will get the job done. This is what all the data science/machine learning teams in the world do. If you want to detect cracks or defects on the windmills or segment different types of lands from just aerial images, this is the way. If you aim to give personalized health insights or do personalized marketing for Electronic health data, this is the way. If you have to show personalized ads in an optimal way, this is the way. If you have to detect people, vehicles, street signs even km away, this is the way. If you have to detect intrusion into your property or a supermarket, this is the way. If you want to build intelligent systems to assist radiologists, this is the way. In fact, at Akaike Technologies, we used computer vision to deal with all of these problems and brought a great impact on efficiency and revenue.

These are the ways companies are using computer vision. This will give a sense of how broad computer vision is. AES is using drones and computer vision to make inspecting energy assets safer and more efficient. LG CNS is using Computer Vision to accurately detect defects in various products on the assembly line. Nordstrom is using Computer Vision-based product search to enable shoppers to easily find products simply by taking a photo. Unilever is using Computer Vision to gain new insights on consumer behavior and improve ad campaign effectiveness. IDEXX is using Computer Vision to automatically organize medical imagery and improve the productivity of their radiologists. All of these systems are based on Convolutional Neural Networks. Most businesses could bring a great change in adopting computer vision.

But don’t get deceived, it takes a lot of skill to formulate the problem as one that aligns with the strength of current computer vision models, to design a model specific to a problem and to train it, to iterate on it till it is production-ready, and to scale the solutions to impact billions of people. The skill is a sound combination of research and implementation in both adopting current literature into production and at the same time contributing to this scientific endeavor. At Akaike Technologies, we have proved to be excellent in solving the hardest of problems. You can see some of our accomplishments below at the end.

The prospects of Computer Vision

Being computer vision as successful as it is now, we are not anywhere near to the ultimate aim of this research endeavor, i.e to automate the visual intelligence we possess. There is quite a lot of discussion in this space of how we can make sure that current computer vision techniques would acquire comprehensive scene understanding. Apart from the statements where this term is used more liberally, comprehensive scene understanding is in some sense the ultimate fulfillment of computer vision.

To be intentionally and sensibly critical about the models we have today, simply put, we can say that it doesn’t have a clue of how the world works and how humans behave/feel. It could detect people in a scene and identify them, but it doesn’t know if someone is a customer or intruder based on their behavior which a human would recognize in most cases within moments. Ideally, it is intuitive in the sense that if we have data for each thing we can learn everything. Theoretically, this is true but nobody can find data of what happens if people do one thing rather than other (just as a single instance lets say, teasing about throwing something at you as opposed to actually throwing), and nobody can bring enough data of visual behavior that the model can understand how the person is feeling, his aims/goal/motivation.

Sometimes, it even feels practical to get the data and make machines learn the things described above, but it is just because we singled out a simple instance of so many great skills humans possess. The ability to understand the concept with minimal supervision, to combine different concepts to understand more complex concepts, to have the knowledge of intuitive physics of the world, intuitive psychology of humans, the ability to learn, and so on. Aiming to learn these things in the ways we model data now feels, and rightly argued by so many people, stupid.

This is evident for the computer vision(deep learning) community at large as well. And the community is striving towards models that can learn from data without many human labels sometimes even unsupervised, models that learn to learn different tasks at the same time, models that could interactively acquire data and then reason about it to make decisions. Also, to solve computer vision completely, things that feel distinct from it, like decision making, attention, and interactive learning are nevertheless paramount in building systems that compete with very efficiently evolved human visual intelligence.

Along with this, there are a lot of investigations from cognitive science, neuroscience, and philosophy that are flowing into computer vision (deep learning) conferences to equip current models with the abilities that surfaced when thinking about computer vision critically [13], [14]. We, at Akaike Technologies, see imparting these abilities into computer vision is one of the important directions in the next five to ten years and actively participate and contribute along with the community.

Computer vision in itself emerged as an application of learning systems, but now it grows into a significant ‘scientific’ endeavor to model visual intelligence. Now there is a different sense to computer vision and computer vision applied. We are happy that the field separated different goals and their approach and moving forward complementing each other.

Comprehensive scene understanding, at large, can impart potentially huge value to applied computer vision as well. Next-generation intruder detection, disaster management, and behavior profile analytics systems are just some of the low hanging fruits for the community, along with improvement in about everything we do now.

We believe that to excel at applied computer vision(or applied deep learning), beyond current successes, and to be sustainable in five to ten years, one has to exploit current technology with a skill to discern its strengths and weaknesses and be on the horizon of advancements by being critical and asking questions nobody else dares to ask.

Akaike Technologies

We are a group of rapidly growing deep learning experts with two decades of experience in a wide range of domains. We have a proven track record of solving hard problems with computer vision, NLP, and deep learning. We strive to always be at the edge of deep learning, exploiting, and exploring.

References

[1] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), p.484.

[2] Papert, Seymour (1966–07–01). “The Summer Vision Project”. MIT AI Memos (1959–2004). HDL:1721.1/6125.

[3] Marr, D., 1982. Vision: A computational investigation into the human representation and processing of visual information.

[4] Minsky, M., 1990. Logical vs. Analogical or Symbolic vs. Connectionist or Neat vs. Scruffy Artificial Intelligence at MIT. Expanding Frontiers, Patrick H. Winston (Ed.).

[5] Jiang, X., 2009, August. Feature extraction for image recognition and computer vision. In 2009 2nd IEEE International Conference on Computer Science and Information Technology (pp. 1–15). IEEE.

[6] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

[7] McClelland, J.L., Rumelhart, D.E., and PDP Research Group, 1986. Parallel distributed processing. Explorations in the Microstructure of Cognition, 2, pp.216–271.

[8] Fukushima, K., 1980. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), pp.193–202.

[9] Le Cun, Y., Matan, O., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D. and Baird, H.S., 1990, June. Handwritten zip code recognition with multilayer networks. In Proc. 10th International Conference on Pattern Recognition (Vol. 2, pp. 35–40).

[10] Glorot, X. and Bengio, Y., 2010, March. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).

[11] Rumelhart, D.E., Hinton, G.E., and Williams, R.J., 1985. Learning internal representations by error propagation (No. ICS-8506). California Univ San Diego La Jolla Inst for Cognitive Science.

[12] Zeiler, M.D., and Fergus, R., 2014, September. Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818–833). Springer, Cham.

[13] Hassabis, D., Kumaran, D., Summerfield, C., and Botvinick, M., 2017. Neuroscience-inspired artificial intelligence. Neuron, 95(2), pp.245–258.

[14] Lake, B.M., Ullman, T.D., Tenenbaum, J.B. and Gershman, S.J., 2017. Building machines that learn and think like people. Behavioral and brain sciences, 40.

--

--

Prakash Kagitha
Akaike Technologies

Storyteller of art and science. Deep learning & Cognitive science.