init - 初始化项目
This commit is contained in:
@@ -0,0 +1,211 @@
|
||||
@cond CUDA_MODULES
|
||||
Similarity check (PNSR and SSIM) on the GPU {#tutorial_gpu_basics_similarity}
|
||||
===========================================
|
||||
|
||||
@tableofcontents
|
||||
|
||||
@todo update this tutorial
|
||||
|
||||
@next_tutorial{tutorial_gpu_thrust_interop}
|
||||
|
||||
Goal
|
||||
----
|
||||
|
||||
In the @ref tutorial_video_input_psnr_ssim tutorial I already presented the PSNR and SSIM methods for checking
|
||||
the similarity between the two images. And as you could see, the execution process takes quite some
|
||||
time , especially in the case of the SSIM. However, if the performance numbers of an OpenCV
|
||||
implementation for the CPU do not satisfy you and you happen to have an NVIDIA CUDA GPU device in
|
||||
your system, all is not lost. You may try to port or write your owm algorithm for the video card.
|
||||
|
||||
This tutorial will give a good grasp on how to approach coding by using the GPU module of OpenCV. As
|
||||
a prerequisite you should already know how to handle the core, highgui and imgproc modules. So, our
|
||||
main goals are:
|
||||
|
||||
- What's different compared to the CPU?
|
||||
- Create the GPU code for the PSNR and SSIM
|
||||
- Optimize the code for maximal performance
|
||||
|
||||
The source code
|
||||
---------------
|
||||
|
||||
You may also find the source code and the video file in the
|
||||
`samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity` directory of the OpenCV
|
||||
source library or download it from [here](https://github.com/opencv/opencv/tree/master/samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp).
|
||||
The full source code is quite long (due to the controlling of the application via the command line
|
||||
arguments and performance measurement). Therefore, to avoid cluttering up these sections with those
|
||||
you'll find here only the functions itself.
|
||||
|
||||
The PSNR returns a float number, that if the two inputs are similar between 30 and 50 (higher is
|
||||
better).
|
||||
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getpsnr
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getpsnrcuda
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp psnr
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getpsnropt
|
||||
|
||||
The SSIM returns the MSSIM of the images. This is too a floating point number between zero and one (higher is
|
||||
better), however we have one for each channel. Therefore, we return a *Scalar* OpenCV data
|
||||
structure:
|
||||
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getssim
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getssimcuda
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp ssim
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getssimopt
|
||||
|
||||
How to do it? - The GPU
|
||||
-----------------------
|
||||
|
||||
As see above, we have three types of functions for each operation. One for the CPU and two for
|
||||
the GPU. The reason I made two for the GPU is too illustrate that often simple porting your CPU to
|
||||
GPU will actually make it slower. If you want some performance gain you will need to remember a few
|
||||
rules, for which I will go into detail later on.
|
||||
|
||||
The development of the GPU module was made so that it resembles as much as possible its CPU
|
||||
counterpart. This makes the porting process easier. The first thing you need to do before writing any code is
|
||||
to link the GPU module to your project, and include the header file for the module. All the
|
||||
functions and data structures of the GPU are in a *gpu* sub namespace of the *cv* namespace. You may
|
||||
add this to the default one via the *use namespace* keyword, or mark it everywhere explicitly via
|
||||
the cv:: to avoid confusion. I'll do the later.
|
||||
@code{.cpp}
|
||||
#include <opencv2/gpu.hpp> // GPU structures and methods
|
||||
@endcode
|
||||
|
||||
GPU stands for "graphics processing unit". It was originally built to render graphical
|
||||
scenes. These scenes somehow build on a lot of data. Nevertheless, these aren't all dependent one
|
||||
from another in a sequential way and as it is possible a parallel processing of them. Due to this a
|
||||
GPU will contain multiple smaller processing units. These aren't the state of the art processors and
|
||||
on a one on one test with a CPU it will fall behind. However, its strength lies in its numbers. In
|
||||
the last years there has been an increasing trend to harvest these massive parallel powers of the
|
||||
GPU in non-graphical scenes; rendering as well. This gave birth to the general-purpose computation on
|
||||
graphics processing units (GPGPU).
|
||||
|
||||
The GPU has its own memory. When you read data from the hard drive with OpenCV into a *Mat* object
|
||||
that takes place in your systems memory. The CPU works somehow directly on this (via its cache),
|
||||
however the GPU cannot. It has to transfer the information required for calculations from the
|
||||
system memory to its own. This is done via an upload process and is time consuming. In the end the result
|
||||
will have to be downloaded back to your system memory for your CPU to see and use it. Porting
|
||||
small functions to GPU is not recommended as the upload/download time will be larger than the amount
|
||||
you gain by a parallel execution.
|
||||
|
||||
Mat objects are stored only in the system memory (or the CPU cache). For getting an OpenCV matrix to
|
||||
the GPU you'll need to use its GPU counterpart @ref cv::cuda::GpuMat. It works similar to the Mat with a
|
||||
2D only limitation and no reference returning for its functions (cannot mix GPU references with CPU
|
||||
ones). To upload a Mat object to the GPU you need to call the upload function after creating an
|
||||
instance of the class. To download you may use simple assignment to a Mat object or use the download
|
||||
function.
|
||||
@code{.cpp}
|
||||
Mat I1; // Main memory item - read image into with imread for example
|
||||
gpu::GpuMat gI; // GPU matrix - for now empty
|
||||
gI1.upload(I1); // Upload a data from the system memory to the GPU memory
|
||||
|
||||
I1 = gI1; // Download, gI1.download(I1) will work too
|
||||
@endcode
|
||||
Once you have your data up in the GPU memory you may call GPU enabled functions of OpenCV. Most of
|
||||
the functions keep the same name just as on the CPU, with the difference that they only accept
|
||||
*GpuMat* inputs.
|
||||
|
||||
Another thing to keep in mind is that not for all channel numbers you can make efficient algorithms
|
||||
on the GPU. Generally, I found that the input images for the GPU images need to be either one or
|
||||
four channel ones and one of the char or float type for the item sizes. No double support on the
|
||||
GPU, sorry. Passing other types of objects for some functions will result in an exception throw,
|
||||
and an error message on the error output. The documentation details in most of the places the types
|
||||
accepted for the inputs. If you have three channel images as an input you can do two things: either
|
||||
add a new channel (and use char elements) or split up the image and call the function for each
|
||||
image. The first one isn't really recommended as this wastes memory.
|
||||
|
||||
For some functions, where the position of the elements (neighbor items) doesn't matter, the quick
|
||||
solution is to reshape it into a single channel image. This is the case for the PSNR
|
||||
implementation where for the *absdiff* method the value of the neighbors is not important. However,
|
||||
for the *GaussianBlur* this isn't an option and such need to use the split method for the SSIM. With
|
||||
this knowledge you can make a GPU viable code (like mine GPU one) and run it. You'll be
|
||||
surprised to see that it might turn out slower than your CPU implementation.
|
||||
|
||||
Optimization
|
||||
------------
|
||||
|
||||
The reason for this is that you're throwing out on the window the price for memory allocation and
|
||||
data transfer. And on the GPU this is damn high. Another possibility for optimization is to
|
||||
introduce asynchronous OpenCV GPU calls too with the help of the @ref cv::cuda::Stream.
|
||||
|
||||
-# Memory allocation on the GPU is considerable. Therefore, if it’s possible allocate new memory as
|
||||
few times as possible. If you create a function what you intend to call multiple times it is a
|
||||
good idea to allocate any local parameters for the function only once, during the first call. To
|
||||
do this you create a data structure containing all the local variables you will use. For
|
||||
instance in case of the PSNR these are:
|
||||
@code{.cpp}
|
||||
struct BufferPSNR // Optimized GPU versions
|
||||
{ // Data allocations are very expensive on GPU. Use a buffer to solve: allocate once reuse later.
|
||||
gpu::GpuMat gI1, gI2, gs, t1,t2;
|
||||
|
||||
gpu::GpuMat buf;
|
||||
};
|
||||
@endcode
|
||||
Then create an instance of this in the main program:
|
||||
@code{.cpp}
|
||||
BufferPSNR bufferPSNR;
|
||||
@endcode
|
||||
And finally pass this to the function each time you call it:
|
||||
@code{.cpp}
|
||||
double getPSNR_GPU_optimized(const Mat& I1, const Mat& I2, BufferPSNR& b)
|
||||
@endcode
|
||||
Now you access these local parameters as: *b.gI1*, *b.buf* and so on. The GpuMat will only
|
||||
reallocate itself on a new call if the new matrix size is different from the previous one.
|
||||
|
||||
-# Avoid unnecessary function data transfers. Any small data transfer will be significant once
|
||||
you go to the GPU. Therefore, if possible, make all calculations in-place (in other words do not
|
||||
create new memory objects - for reasons explained at the previous point). For example, although
|
||||
expressing arithmetical operations may be easier to express in one line formulas, it will be
|
||||
slower. In case of the SSIM at one point I need to calculate:
|
||||
@code{.cpp}
|
||||
b.t1 = 2 * b.mu1_mu2 + C1;
|
||||
@endcode
|
||||
Although the upper call will succeed, observe that there is a hidden data transfer present.
|
||||
Before it makes the addition it needs to store somewhere the multiplication. Therefore, it will
|
||||
create a local matrix in the background, add to that the *C1* value and finally assign that to
|
||||
*t1*. To avoid this we use the gpu functions, instead of the arithmetic operators:
|
||||
@code{.cpp}
|
||||
gpu::multiply(b.mu1_mu2, 2, b.t1); //b.t1 = 2 * b.mu1_mu2 + C1;
|
||||
gpu::add(b.t1, C1, b.t1);
|
||||
@endcode
|
||||
-# Use asynchronous calls (the @ref cv::cuda::Stream ). By default whenever you call a GPU function
|
||||
it will wait for the call to finish and return with the result afterwards. However, it is
|
||||
possible to make asynchronous calls, meaning it will call for the operation execution, making the
|
||||
costly data allocations for the algorithm and return back right away. Now you can call another
|
||||
function, if you wish. For the MSSIM this is a small optimization point. In our default
|
||||
implementation we split up the image into channels and call them for each channel the GPU
|
||||
functions. A small degree of parallelization is possible with the stream. By using a stream we
|
||||
can make the data allocation, upload operations while the GPU is already executing a given
|
||||
method. For example, we need to upload two images. We queue these one after another and call
|
||||
the function that processes it. The functions will wait for the upload to finish,
|
||||
however while this happens it makes the output buffer allocations for the function to be executed
|
||||
next.
|
||||
@code{.cpp}
|
||||
gpu::Stream stream;
|
||||
|
||||
stream.enqueueConvert(b.gI1, b.t1, CV_32F); // Upload
|
||||
|
||||
gpu::split(b.t1, b.vI1, stream); // Methods (pass the stream as final parameter).
|
||||
gpu::multiply(b.vI1[i], b.vI1[i], b.I1_2, stream); // I1^2
|
||||
@endcode
|
||||
|
||||
Result and conclusion
|
||||
---------------------
|
||||
|
||||
On an Intel P8700 laptop CPU paired with a low end NVIDIA GT220M, here are the performance numbers:
|
||||
@code
|
||||
Time of PSNR CPU (averaged for 10 runs): 41.4122 milliseconds. With result of: 19.2506
|
||||
Time of PSNR GPU (averaged for 10 runs): 158.977 milliseconds. With result of: 19.2506
|
||||
Initial call GPU optimized: 31.3418 milliseconds. With result of: 19.2506
|
||||
Time of PSNR GPU OPTIMIZED ( / 10 runs): 24.8171 milliseconds. With result of: 19.2506
|
||||
|
||||
Time of MSSIM CPU (averaged for 10 runs): 484.343 milliseconds. With result of B0.890964 G0.903845 R0.936934
|
||||
Time of MSSIM GPU (averaged for 10 runs): 745.105 milliseconds. With result of B0.89922 G0.909051 R0.968223
|
||||
Time of MSSIM GPU Initial Call 357.746 milliseconds. With result of B0.890964 G0.903845 R0.936934
|
||||
Time of MSSIM GPU OPTIMIZED ( / 10 runs): 203.091 milliseconds. With result of B0.890964 G0.903845 R0.936934
|
||||
@endcode
|
||||
In both cases we managed a performance increase of almost 100% compared to the CPU implementation.
|
||||
It may be just the improvement needed for your application to work. You may observe a runtime
|
||||
instance of this on the [YouTube here](https://www.youtube.com/watch?v=3_ESXmFlnvY).
|
||||
|
||||
@youtube{3_ESXmFlnvY}
|
||||
@endcond
|
||||
@@ -0,0 +1,76 @@
|
||||
@cond CUDA_MODULES
|
||||
Using a cv::cuda::GpuMat with thrust {#tutorial_gpu_thrust_interop}
|
||||
===========================================
|
||||
|
||||
@tableofcontents
|
||||
|
||||
@prev_tutorial{tutorial_gpu_basics_similarity}
|
||||
|
||||
Goal
|
||||
----
|
||||
|
||||
Thrust is an extremely powerful library for various cuda accelerated algorithms. However thrust is designed
|
||||
to work with vectors and not pitched matricies. The following tutorial will discuss wrapping cv::cuda::GpuMat's
|
||||
into thrust iterators that can be used with thrust algorithms.
|
||||
|
||||
This tutorial should show you how to:
|
||||
- Wrap a GpuMat into a thrust iterator
|
||||
- Fill a GpuMat with random numbers
|
||||
- Sort a column of a GpuMat in place
|
||||
- Copy values greater than 0 to a new gpu matrix
|
||||
- Use streams with thrust
|
||||
|
||||
Wrapping a GpuMat into a thrust iterator
|
||||
----
|
||||
|
||||
The following code will produce an iterator for a GpuMat
|
||||
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-thrust-interop/Thrust_interop.hpp begin_itr
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-thrust-interop/Thrust_interop.hpp end_itr
|
||||
|
||||
Our goal is to have an iterator that will start at the beginning of the matrix, and increment correctly to access continuous matrix elements. This is trivial for a continuous row, but how about for a column of a pitched matrix? To do this we need the iterator to be aware of the matrix dimensions and step. This information is embedded in the step_functor.
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-thrust-interop/Thrust_interop.hpp step_functor
|
||||
The step functor takes in an index value and returns the appropriate
|
||||
offset from the beginning of the matrix. The counting iterator simply increments over the range of pixel elements. Combined into the transform_iterator we have an iterator that counts from 0 to M*N and correctly
|
||||
increments to account for the pitched memory of a GpuMat. Unfortunately this does not include any memory location information, for that we need a thrust::device_ptr. By combining a device pointer with the transform_iterator we can point thrust to the first element of our matrix and have it step accordingly.
|
||||
|
||||
Fill a GpuMat with random numbers
|
||||
----
|
||||
Now that we have some nice functions for making iterators for thrust, lets use them to do some things OpenCV can't do. Unfortunately at the time of this writing, OpenCV doesn't have any Gpu random number generation.
|
||||
Thankfully thrust does and it's now trivial to interop between the two.
|
||||
Example taken from http://stackoverflow.com/questions/12614164/generating-a-random-number-vector-between-0-and-1-0-using-thrust
|
||||
|
||||
First we need to write a functor that will produce our random values.
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-thrust-interop/main.cu prg
|
||||
|
||||
This will take in an integer value and output a value between a and b.
|
||||
Now we will populate our matrix with values between 0 and 10 with a thrust transform.
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-thrust-interop/main.cu random
|
||||
|
||||
Sort a column of a GpuMat in place
|
||||
----
|
||||
|
||||
Lets fill matrix elements with random values and an index. Afterwards we will sort the random numbers and the indecies.
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-thrust-interop/main.cu sort
|
||||
|
||||
Copy values greater than 0 to a new gpu matrix while using streams
|
||||
----
|
||||
In this example we're going to see how cv::cuda::Streams can be used with thrust. Unfortunately this specific example uses functions that must return results to the CPU so it isn't the optimal use of streams.
|
||||
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-thrust-interop/main.cu copy_greater
|
||||
|
||||
|
||||
First we will populate a GPU mat with randomly generated data between -1 and 1 on a stream.
|
||||
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-thrust-interop/main.cu random_gen_stream
|
||||
|
||||
Notice the use of thrust::system::cuda::par.on(...), this creates an execution policy for executing thrust code on a stream.
|
||||
There is a bug in the version of thrust distributed with the cuda toolkit, as of version 7.5 this has not been fixed. This bug causes code to not execute on streams.
|
||||
The bug can however be fixed by using the newest version of thrust from the git repository. (http://github.com/thrust/thrust.git)
|
||||
Next we will determine how many values are greater than 0 by using thrust::count_if with the following predicate:
|
||||
|
||||
@snippet samples/cpp/tutorial_code/gpu/gpu-thrust-interop/main.cu pred_greater
|
||||
|
||||
We will use those results to create an output buffer for storing the copied values, we will then use copy_if with the same predicate to populate the output buffer.
|
||||
Lastly we will download the values into a CPU mat for viewing.
|
||||
@endcond
|
||||
BIN
doc/tutorials/gpu/images/gpu-basics-similarity.png
Normal file
BIN
doc/tutorials/gpu/images/gpu-basics-similarity.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 5.6 KiB |
28
doc/tutorials/gpu/table_of_content_gpu.markdown
Normal file
28
doc/tutorials/gpu/table_of_content_gpu.markdown
Normal file
@@ -0,0 +1,28 @@
|
||||
@cond CUDA_MODULES
|
||||
GPU-Accelerated Computer Vision (cuda module) {#tutorial_table_of_content_gpu}
|
||||
=============================================
|
||||
|
||||
Squeeze out every little computation power from your system by using the power of your video card to
|
||||
run the OpenCV algorithms.
|
||||
|
||||
- @subpage tutorial_gpu_basics_similarity
|
||||
|
||||
*Languages:* C++
|
||||
|
||||
*Compatibility:* \> OpenCV 2.0
|
||||
|
||||
*Author:* Bernát Gábor
|
||||
|
||||
This will give a good grasp on how to approach coding on the GPU module, once you already know
|
||||
how to handle the other modules. As a test case it will port the similarity methods from the
|
||||
tutorial @ref tutorial_video_input_psnr_ssim to the GPU.
|
||||
|
||||
- @subpage tutorial_gpu_thrust_interop
|
||||
|
||||
*Languages:* C++
|
||||
|
||||
*Compatibility:* \>= OpenCV 3.0
|
||||
|
||||
This tutorial will show you how to wrap a GpuMat into a thrust iterator in order to be able to
|
||||
use the functions in the thrust library.
|
||||
@endcond
|
||||
Reference in New Issue
Block a user