When we talk about artificial intelligence in computer vision, our minds often jump to familiar concepts — face detection, self-driving cars, surveillance cameras. But behind these technologies lies a deceptively complex task: understanding who or what is moving, and where, in real time. This is the domain of object detection and tracking, a space where milliseconds matter and mistakes have consequences.
Kirill Starkov knows this field intimately. With a background in machine learning and a particular focus on computer vision, Kirill has spent years working on production-level systems involving object detection and tracking. His experience spans from video analytics for surveillance to automated licence plate recognition systems and pandemic-response tools built under tight deadlines.
“This field looks clean in theory,” Kirill begins, “but in practice, it’s chaotic. You’re working with noise, low-quality streams, overlapping objects, changing light conditions, and unpredictable behaviour. Every frame is a decision — and every mistake is cumulative.”
Understanding the Technical Stack
At its core, real-time object detection and tracking relies on a combination of well-known models and algorithms. “In most practical pipelines, you’ll see something like YOLO for detection, paired with DeepSORT or a custom tracker for the temporal association,” Kirill explains. “YOLOv4 and YOLOv5 have become very popular in recent years because of their speed-to-accuracy ratio. But the tracker is where the real finesse happens.”
He describes tracking as the ‘glue’ that holds the detection together. “Detection tells you what’s in the current frame. Tracking tells you what’s been there — and whether it’s still the same object. Without a reliable tracker, you’re just guessing between frames.”
The DeepSORT algorithm, a refinement over the simpler SORT (Simple Online Realtime Tracking), uses both motion (via Kalman filters) and appearance descriptors to associate detections across frames. Kirill has frequently worked with both. “SORT is faster, and if your objects are distinct and well-separated, it’s enough. But in crowded scenes, or when objects look similar, you need DeepSORT or even your own enhancements on top.”
Real World ≠ Benchmark
Kirill is quick to point out the disparity between academic papers and production deployments. “In benchmarks, you’re running videos from clear, daylight street scenes. In real life, you might be dealing with a night-time car park where half the footage is glare and shadows.”
He recalls one long-term project involving real-time vehicle tracking. “We were building a system to track cars and read their number plates under varying conditions — rain, snow, fog. The detection part was relatively stable, but tracking became unpredictable during certain hours of the day due to lighting changes.”
To solve it, Kirill’s team introduced custom appearance models trained on the specific environment, and even fine-tuned the motion model to better reflect the geometry of the location. “It’s not something you can get from plug-and-play packages. You need to know your data intimately and adapt your tools.”
The project involved a hybrid recognition model for licence plates, which combined transformer layers with an RNN decoder, an unusual architecture at the time. But Kirill highlights that no matter how novel the model, if tracking failed, the whole system lost value. “If your system says this is a new car every five seconds, you can’t build trust or automation on that.”
Challenges and Edge Cases
“Occlusion is the classic problem,” Kirill says. “You’ve got someone walking behind a pillar. Or two cars crossing paths. Your model needs to predict, interpolate, and recover. But even more difficult are long-term consistency and ID switches.”
In real-time systems, identity switches, when a tracker accidentally assigns the wrong ID to an object, are especially harmful. “If a system is meant to monitor the flow of people through a building, or detect abnormal behaviour, an ID switch creates false narratives. It looks like a person duplicated, disappeared, or teleported.”
To reduce these issues, Kirill’s teams often integrate additional context. “If we know the physical layout of a space, we can model likely paths. If we have other sensors, we can cross-reference. Multimodal systems are one of the ways forward.”
He also highlights a common misconception: that increasing detection frequency always improves tracking. “Sometimes, more frames lead to more noise. If your detection isn’t stable, the tracker can’t compensate. It’s about precision, not just volume.”
Applications and Impact
Kirill’s work has had wide-ranging applications, from smart city initiatives to retail security. During the early stages of the COVID-19 pandemic, his team adapted part of their detection-tracking pipeline for a new purpose: face mask detection in public spaces.
“It was a two-day sprint,” he says. “We built a prototype that could detect whether someone was wearing a mask, and tracked them across a space to ensure compliance over time. The system was deployed in several retail chains. It wasn’t perfect, but it served a public health function at a critical time.”
That project later became a cornerstone of a client’s crisis-response suite, and contributed to client retention during the most uncertain period of their business. “It was a reminder that these systems aren’t just technical. They serve people. They inform decisions.”
Advice for Engineers Entering the Field
When asked what advice he would give to engineers starting in real-time computer vision, Kirill emphasises mindset over tools. “Don’t get too attached to the algorithms. Learn the principles. Learn what makes a system robust. Real-time tracking is about trade-offs: between accuracy, speed, and complexity. You won’t get all three.”
He encourages young developers to start with existing open-source pipelines like the YOLO + DeepSORT combo, but not to stop there. “Run your models on dirty data. Film your own video in a train station or a parking lot. Break the system. That’s how you’ll learn what matters.”
Looking to the future, Kirill sees growing relevance in multimodal tracking, federated systems, and self-adapting trackers that learn on the edge. “We’re past the phase of isolated pipelines. Integration is the future — between sensors, between modalities, and between the model and its environment.”