This blog explains how to track objects like person and any kind of object.
Two important things are mainly used for object tracking. I have listed below.
i) Object Detection
ii) Object Tracking
Because if we need to track any object, first detect that object from a video frame and keep track it.
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos.
Object Detection- It is the process of classifying and determining an object’s position in the image. This step needs to be done before object tracking because we need to find where the particular object is present at every frame.
The object detection would return the bounding box coordinates along with the class of objects it belongs to.
Methods for object detection generally fall into either machine learning-based approaches or deep learning-based approaches. For Machine Learning approaches, it becomes necessary to first define features using one of the methods below, then using a technique such as support vector machine (SVM) to do the classification. On the other hand, deep learning techniques that are able to do end-to-end object detection without specifically defining features, and are typically based on convolutional neural networks (CNN).
Machine Learning approaches:
- Viola-Jones object detection framework based on Haar features
- Scale-invariant feature transform (SIFT)
- Histogram of oriented gradients (HOG) features
Deep Learning approaches(architecture):
- Region Proposals (R-CNN, Fast R-CNN, Faster R-CNN)
- Single Shot MultiBox Detector (SSD)
- You Only Look Once (YOLO)
Chosen Algorithm for Detection:
There are various frameworks to implement object detection such as Tensorflow, Keras, YOLO, PyTorch, etc but I have used the YOLO framework and their own model for object detection also called YOLO (You Look Only Once).
I used YOLOv3 instead of YOLOv2 as the authors of the paper did claim that the YOLOv3 architecture yielded more accurate results. Despite the fact that YOLOv3 has more number of layers than YOLOv2, it doesn't drop the frame per second(FPS).
The basic concept of YOLOv3 is that it splits the image into an NxN grid, in each grid cell three boxes are predicted and each box predicts a totally 4+1+(no.of.classes). The 4 points are the bounding box coordinates i.e x,y,w,h where x and y are the center coordinates of the box, ‘w’ is width and ‘h’ is height. The 1 represents the objectness score i.e probability that the box contains an object (this value lies between 0 and 1).
The architecture uses a lot of residual layers which avoids any compromise in performance with an increase in layers.
Using this I detect and the object and use a tracking algorithm(discussed below) to track the objects of interest.
Blogs for object detection using yolov2 and v3:
1. How to train YOLOv2 to detect custom objects
2. How to train multiple objects in YOLOv2 using your own Dataset
3. How to train YOLOv3 to detect custom objects
Object tracking can be defined as the process of locking on to a moving object and being able to determine if the object is the same as the one present in the previous frame.
- Motion detection: Often from a static camera. Common in surveillance systems. Often performed on the pixel level only (due to speed constraints).
- Object localization: Focuses attention to a region of interest in the image. Data Reduction. Often only interest points found which are used later to solve the correspondence problem.
- Motion segmentation: Images are segmented into region corresponding to different moving objects.
- Three-dimensional shape from motion: Also called structure from motion. Similar problems as in stereo vision.
- Object tracking: A sparse set of features is often tracked, e.g., corner points
To make it more clear let's take an example of 3 cars moving now we need to find out the car that reaches the finish line first.
For this purpose we need to do a couple of things:
1. Uniquely identify the car by some number or an alphabet
2. Maintain that number or alphabet for each car throughout the race by keeping track of them while moving.
3. The car that reaches first can be identified by the number.
The above steps involve the use of Object Tracking.
A prerequisite would be to have an object detector to detect the object of interest.
This can be done using a variety of methods such as:
1. Use of the previous frame
2. Use of n-previous frames
We could store the past positions of each object in a structured format and use it as a reference to determine which unique ID is associated with which object. The solution would work but it would not be scalable as we require a lot of memory to store the data of the previous frame and we need to process over n-frames of data to determine the ID.
3. My proposed Algorithm
In my algorithm, I use just the immediate previous frames data, IOU(Intersection Over Union) and linear sum assignment.
The algorithm involves finding the IOU between all combinations of objects of the current and previous frames and making use of the linear sum assignment to assign the unique ID based on the least IOU.