Can your object detector detect people and horses in the following image?
What if the same image is rotated by 90 degrees? Can it detect people and horses?
Or a cat in these images?
We have come a long way in the advancement of computer vision. Object detection algorithms using AI have outperformed humans in certain tasks. But why is it that it is still a challenge to detect a person if the image is rotated 90 degrees, a cat if it lying in an uncommon position, or an object if only part of it is visible?
A lot of models have been created for object detection and classification since AlexNet in 2012, and they are getting better in terms of accuracy and efficiency. However, most of the models are trained and tested in ideal scenarios. But in reality, the scenario where these models are used are not always ideal: the background may be cluttered, the object may be deformation, or maybe occluded. Take an example of the images of the cat below. Any object detector trained to detect a cat will, without failure, detect the cat in the image on the left. But for the image on the right, most detectors may fail to detect the cat.
Tasks that are considered trivial for humans are certainly a challenge in computer vision. It is easy for us humans to identify a person, regardless of the image in any orientation or a cat in different poses, or a cup viewed from any angle.
Let’s take a look at 6 such obstacles to detecting objects robustly.
1. Viewpoint variation
An object viewed from different angles may look completely different. Taking a simple example of a cup (referring to images below), the first image, showing a top view of a cup with black coffee looks completely different from the second image with side and top view of the cup with a cappuccino, and the third image with a side view of the cup.
This is one of the challenges with object detection because most detectors are trained with images only from a particular viewpoint.
Many objects of interest are not rigid bodies and can be deformed in extreme ways. As an example, let’s look at images below of yogis in different positions. If the object detector is trained to detect a person with training that only included person sitting, standing, or walking, it might not be able to detect people in these images, as the features in these images may not match the ones that it learned about people during training.
The objects of interest can be occluded. Sometimes only a small portion of an object, as little as few pixels could be visible.
For example, in the above image, the object (cup) is occluded by the person holding the cup. When we see only part of an object, in most cases, we can instantly identify what it is. Object detector, however, does not perform the same.
Another example of occlusion is images where a person is holding a mobile phone. It is a challenge to detect mobile phones in these images:
4. Illumination conditions
The effects of illumination are drastic on the pixel level. Objects exhibit different colors under different illumination conditions. For example, an outdoor surveillance camera is exposed to different lighting conditions throughout the day, including bright daylight, evening, and night light. An image of a pedestrian looks different in these varying illuminations. This affects the capability of the detector to detect objects robustly.
5. Cluttered or textured Background
The objects of interest may blend into the background, making them hard to identify. For example, cat and dog images below are camouflaged with the rug they are sitting/lying on. In these cases, object detectors will face challenges detecting cats and dogs.
6. Intra-class variation
An object of interest can often be relatively broad, such as a house. There are many different types of these objects, each with its own appearance. All the images below are of different types of houses.
A good detector must be robust enough to detect the cross-product of all these variations, while also maintaining sensitivity to the inter-class variations.
For creating a robust object detector, ensure that there is a good variation on training data, for different viewpoints, illumination conditions, and objects in different backgrounds. If you cannot find real-world training data with all the variations, use data augmentation techniques to synthesize the data you need.
Pahuja, A., Majumder, A., Chakraborty, A., & Venkatesh Babu, R. (2019). Enhancing Salient Object Segmentation Through Attention. arXiv preprint arXiv:1905.11522.
Maier, W., Eschey, M., & Steinbach, E. (2011, September). Image-based object detection under varying illumination in environments with specular surfaces. In 2011 18th IEEE International Conference on Image Processing(pp. 1389–1392). IEEE.
Cai, Y., Du, D., Zhang, L., Wen, L., Wang, W., Wu, Y., & Lyu, S. (2019). Guided Attention Network for Object Detection and Counting on Drones. arXiv preprint arXiv:1909.11307.
Hsiao, E., & Hebert, M. (2014). Occlusion reasoning for object detection under arbitrary viewpoint. IEEE transactions on pattern analysis and machine intelligence, 36(9), 1803–1815.
Looking to implement real-time face detection on a Raspberry Pi? Check out this post.