While there have been huge advances in image and object recognition using deep learning and large-scale language models in recent years, this article will describe fast shape-based template matching using classical image processing. The algorithm is based on the following paper with our own improvements, and the implemented application can be found on github.
Paper:
https://stefan-hinterstoisser.com/papers/hinterstoisser2011linemod.pdf
GitHub:
https://github.com/AkitoArai709/HighSpeedShapeBaseTemplateMattching
Background and algorithm overview
The main methods used to detect objects in images are “template matching” and “statistical methods/deep learning”. Each has the following characteristics.
- template matching There is no need to learn and it is characterized by fast processing speed. But it’s easily affected by lighting and background changes, and on-the-fly processing can be difficult.
- Statistical methods and deep learning Provides robust detection that is less affected by lighting and background. However, these methods require a large amount of training data and high computing resources.
In this paper, we propose an efficient and robust method to detect textureless objects by solving these problems.
In the method of this article, only the gradient direction of the object is used as a feature to create template data from the image, and objects similar to the template data are searched in the test image. Improved robustness to contrast changes by processing only the gradient direction without considering the norm (magnitude) of the gradient. The similarity is determined by calculating the cosine of the gradient direction of the feature point to determine the degree of consistency with the template data.
However, since cosine is computationally expensive, a lookup table that quantizes the gradient directions and obtains the evaluation value for each direction is used to simplify the computational cost, and SSE instructions are used to speed up the processing.
Image processing methods
Calculate the gradient angle of the edge (gradient) from the template image to be searched, and use the gradient angle as a feature to create template data. Scan the template data against the test image for similarity.
When the template image is O and the test image is L, the similarity is calculated through the following formula.
ori represents the gradient angle, and R(c+r) represents the neighborhood gradient of size T centered on position c+r in the input image. Cosine evaluates how consistent the gradient angles are. That is to say, calculate the maximum consistency (cosine) of the feature points of the template with respect to the R(c+r) neighborhood, and add the values obtained by all feature points to the cosine, that is, the similarity.
Calculation of gradient direction
Use the Sobel filter to calculate the brightness gradient of the image to obtain the gradient direction; for RGB color images, calculate the gradient of each color channel separately, and use the following formula to find the gradient direction of the largest channel.
The gradient directions are quantized into 8 directions in order to calculate the response map in a later step. To make it robust to noise, the most frequently occurring gradient directions in the 3×3 neighborhood are placed at each location where the gradient norm is above the threshold. Let L be the quantized gradient direction map.
gradient direction diffusion
To avoid evaluating the cosine and max operators in (1) every time the template is evaluated against a test image position, a new binary representation of the gradient around each image position is introduced (J). This binary representation and lookup table are used to efficiently precompute the maximum value for the max operation.
First, the image of the binary representation (J) of the gradient direction diffusion is shown below.
The quantized gradient directions are encoded in binary representation, and each bit is assigned a gradient direction. This gradient direction diffuses into the neighborhood to build a binary representation (J).
Using binary representation, J can be obtained by shifting within the specified range and performing an OR operation to calculate the diffusion. The range to be moved is the neighborhood of size T shown in (2).
Calculate response graph
Next, the similarity of the binary representation (J) to each azimuth angle is precomputed using a lookup table. The precomputed results are defined as reaction plots. An image of the response graph is shown below.
A response map is built for each quantized azimuth angle. The principle is to calculate and store the maximum similarity at a specific azimuth (right picture). The calculation is performed by preparing a lookup table that stores the similarity for each azimuth angle and indexes the similarity by azimuth angle (left image). Based on this response map, a quantified 8-azimuth response map is established.
For reference, examples of lookup tables for azimuth angles of 0 degrees (→) and 90 degrees (↑) are shown below. If the azimuth angles are consistent, cos(0) is “1” and you can see that the similarity decreases with each deviation in azimuth angles.
For each azimuth angle i, the value at each position c in the response map Si is calculated as follows: T represents the lookup table.
The final similarity is calculated as follows.
Memory linearization for parallelization
Not only can modern processors read data from main memory at any time, they can also read from multiple lines of data at the same time, called cache lines. If memory is accessed from random locations, the cache is lost and calculations are slowed down.
On the other hand, accessing multiple values from the same cache line is very efficient. Therefore, storing data in the order of access can speed up calculations.
Due to the diffusion of the above direction into a neighborhood of size T, it is effective to evaluate every T-th pixel, so the response map is stored in memory in a cache-friendly order, as shown in the figure below.
Reassemble each response plot so that values T pixels apart on the X-axis are placed next to each other in memory. After the current row is complete, continue processing rows T pixels apart on the Y-axis.
The linear memory of the template in different directions (black arrow in the figure below) can be translated and added through the offset according to the relative position (Rx, Ry)^T of the template relative to the anchor point coordinates (search reference coordinates)) To calculate the similarity between the input image and the template. Computations can be accelerated by using parallel SSE instructions to perform these additions.
The above is what is described in the paper. Below is the implementation verification.
Diffusion in T neighborhood
As mentioned above, this algorithm spreads features into T neighborhoods and evaluates every T-th pixel to improve computational efficiency.
Regarding the size of T, the detection results when T=8, T=4, and T=1 are as follows.
I found that at T=8, the displacement relative to the object is large, and as the T size decreases, it fits better. This is because, when diffusing with a T size, all evaluation values near T are the same, and there is no way to narrow down the best-fit position.
This is not a problem if you want to detect a rough position, but if you want to do detailed positioning, you need to reduce the size of T.
In the application released on github, high speed and high accuracy are achieved by performing a coarse search with diffusion and pyramid processing near T, followed by a detailed search at T=1.
Evaluate using various images
How to evaluate results using various images. We also used the data set released by Hashimoto Laboratory of Chukyo University for evaluation.
http://isl.sist.chukyo-u.ac.jp/Archives/archives.html
shogi pieces
Template image size: 149×131
Test image size: 600×400
Detection time: 56[ms]
puzzle
- Template image size: 153×115
- Test image size: 600×400
- Detection time: 27[ms].
T-shaped plate
Template image size: 107×121
Test image size: 600×400
Detection time: 380[ms] !
Discussion of evaluation results
While the pieces and puzzle pieces are matched relatively quickly and with high similarity, the T-board takes longer to match compared to them.
This may be due to the larger number of objects with similar feature values and the fact that many candidates were found in the initial coarse search while time was spent in the detailed search. It is also possible that the contours are not well detected due to similar colors between the search target and other objects. If the detection sensitivity of the contour is adjusted, the detection similarity may be increased, but it is easily affected by noise.
source code
The complete project set of applications published on github is available for sale at the following website. Buy it if you want to actually check and run the source code of the algorithms presented.