There have been earlier discussions about the possibility of human flaws being superimposed on the AI systems in the bid to make the AI imitate humans. However, this also directs the possibility of human errors being reflected on the AI output. Recently, a few wrongly labelled images surfaced on the internet wherein a baby was labelled as a nipple, a pizza called a dough, and a swim suit which was ridiculously identified as a bra. At first glance, a matter to laugh off but once you delve deeper into it, there surfaces the inherent problem of mislabeling.
It was recently discovered by a team of MIT researchers that over 3% data in machine learning systems has been wrongly labelled. After inspecting about ten major data sets pertaining to machine learning, the researchers are positive about the fact that about 3.4% of the available data used in artificial intelligence machine learning systems is subject to mislabeling.
The errors range from Amazon and IMDB reviews which are actually negative being labelled as positive and image-based tagging which leads to incorrect identification of the subject, in addition to video based errors.
According to the researchers,
“We identify label errors in the test sets of 10 of the most commonly used computer vision, natural language, and audio datasets, and subsequently study the potential for these errors to affect bench mark results. Errors in the test sets are numerous and widespread: we estimate an average of 3.4% errors across 10 datasets, where for example 2916 label errors comprise 6% of the ImageNet validation set.”
The research paper is titled ‘Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks’
The mislabeling will have far reaching implications if it is not addressed and solved effectively. It can even have an impact on trust in AI systems. This is because incorrectly labelled datasets leads to the AI learning the wrong identification and knowledge, which in turn will pose a challenge for the AI for delivering accurate results. The researchers recommend the use of lower capacity models over higher capacity models, since they tend to have high proportions of wrongly labelled data.