Facebook reimagines self-supervised learning in Computer vision AI with DINO

Source: Fortune

For anyone who thinks that Facebook Inc. is all about social media with Instagram and WhatsApp, let me break the news to you, IT IS NOT! Facebook is one of the top companies in the technology industry and is building the future of technology through its researches and innovations.

Mark Zuckerberg’s Facebook Inc. has announced a major breakthrough in the Artificial Intelligence domain and this time, the dedicated research team has successfully reimagined self -supervised learning and semi-supervised learning for Computer vision Artificial Intelligence.

What is Computer Vision?

Computer Vision is a stream of Computer Science that deals with creating algorithms to teach computer systems about visual representations in real life. It is defined as an interdisciplinary scientific field that primarily focuses on creating advanced digital systems that can replicate the visual and analytical functioning of a human mind. The idea is to teach a computer system to understand, analyse and process the sense of visual data that includes images and videos, in the same way that human beings do.

Computer Vision is a field of study that deals with enabling computers to gain high-level understanding and processing of the complexities involved with a human’s vision system.

In simple layman terms, Computer Vision is science and technology that teaches computers and digital systems to identify and classify objects in the same way as human beings can do. This technology will enable AI systems to understand and process that information and automate further tasks without human intervention.

What is Self-Supervised learning?

Facebook Inc.’s Artificial Intelligence research team has recently announced in a blog post that it has successfully made a breakthrough with a self-supervised learning method that was used to train a “Vision Transformer Model”. A Visual Transformer Model is a part of the vision system that can discover and segment different objects as it notices in a visual setting that includes images and videos, entirely on its own using Artificial intelligence with no human intervention whatsoever.

This is a phenomenal breakthrough that will now create numerous opportunities and possibilities in the field of future technology.

According to a scientific journal on AI computing, unsupervised learning which is also known as Self-Supervised learning in Artificial Intelligence refers to teaching computer systems to perform certain tasks without humans providing any labelled metadata to the system.

The present technology used in Computer Vision Artificial Intelligence requires human beings to input labelled data for images and videos through which the digital system will recognise the objects in a visual representation. For instance, if there is a picture of a dog in the visual representation, humans have to set a label to the image saying ‘dog’ after which the computer vision AI will recognise and process that image and conclude that there is a dog in the picture. However, there is never just one object in the picture when it comes to real-life applications of computer vision Artificial Intelligence and this calls for endless tag inputs.

Facebook’s latest breakthrough in this technology will now make this process fully automated without human intervention. Thus, self-supervised learning will recognise that there is a dog in the picture without any human intervention or addition of tags.

Facebook Inc. has named its self-supervised learning technology, “DINO”.

What is Facebook’s DINO?

Before we dive into the concept of DINO, the research team at Facebook Inc. explains why DINO is created in the first place. Artificial Intelligence is the future of technology and one main obstruction in its path to a wider scope is object segmentation. Segmenting multiple objects is one of the most difficult tasks for computer vision simply because it requires Artificial Intelligence to scan and understand the complexity of everything that is there in an image.

The supervised learning method is used to make these object segmentations where humans have to add large volumes of tags and data to make the AI understand the image as it is and break down the complexity. However, this is the traditional method which may not be ideal in many situations.

Here comes the implication of Facebook’s DINO that uses the self-supervised learning method. According to the Facebook blog post, DINO is created with two major self-supervised approaches, “multi-crop training” and “momentum teacher“. As mentioned in a report by Silicon Angle, when these two approaches are combined with DINO’s “self-attention layers“, the resulting output is an advanced model that can flawlessly carry out object segmentation and is further capable of building a high-level understanding of every object in the visual presentation, all on its own using Artificial Intelligence without any human botheration of inserting volumes of tags and connotations to the objects in the image or video.

Mike Schroepfer is the Chief Technology Officer at Facebook Inc. and he has shared a live example of how DINO actually works as a real-world application.



As you can see, there is a clear distinction and flawless object segmentation in these four videos where DINO’s computer vision Artificial Intelligence is achieving state of the art results without human intervention. As mentioned by Mike in his tweet, there is absolutely no input of labelled training data and the object segmentation in these videos is quick and “State of the Art”.

Furthermore, Facebook’s Artificial Intelligence research team also highlight the fact that if ImageNet classes are embedded in the features computed with DINO, then the resulting output would include the system automatically organising similar categories in an interpretable way based on their visual properties. This behaviour of the computer vision AI kind of represents a human vision system’s behaviour with similar objects.


A report by Silicon Angle also reports that DINO is now a perfect computer vision AI for general image classification routines and outrightly excel at identifying copies of the same image. Facebook researchers have confirmed that this functionality is not taught to the system, it was not intended to learn that but it did on its own using DINO which is a very productive sign of advanced technology possibilities in the future.



What is Facebook’s Semi-supervised learning method- PAWS?

PAWS is a new model training approach that is said to be built on DINO’s self-learning format. The word “Semi” does involve a significant distinction between PAWS and DINO and that includes a small amount of labelled data, about one-tenth of the traditional approach to achieve absolute state of the art results.

PAWS is equally accurate as DINO and certainly built to save a lot of time and effort from traditionally entering volumes of labelled tags and data to make the computer understand the complexities of an image or video. According to Facebook, when training a ResNet-50 model, PAWS achieve state of the art results by using just 1% of the traditional training dataset with cent percent accuracy.

Facebook’s DINO and PAWS- Future of Computer Vision AI systems

Facebook has successfully created DINO as a self-supervised learning method and PAWS as a semi-supervised learning method. Both of these methods will make the future of Computer vision AI far more accessible and accurate than it is now.

Source: SiliconANGLE

As mentioned in a report by SiliconANGLE,

Facebook researchers say that the need for human annotation usually serves as a bottleneck in the development of Computer vision AI systems and using the advancements of DINO and PAWS, the team can allow these models to be applied in a larger set of computing tasks and potentially scale and improve the number of visual concepts that these new methods can recognise. Both DINO and PAWS are available as open-source code on GitHub, confirms Facebook.