-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Developer's guide
Some useful ressources to jump into digital color management, editing pipeline, calibrations, view transform, etc. :
- https://www.visualeffectssociety.com/sites/default/files/files/cinematic_color_ves.pdf
- https://acescentral.com/
- http://last.hit.bme.hu/download/firtha/video/Colorimetry/Fairchild_M._Color_appearance_models__2005.pdf
Pixels are essentially 4D RGBA vectors. Since 2004, processors have got special abilities to process vectors and apply Single Instructions on Multiple Data (SIMD). This allows to speed-up the computations by processing 1 pixel (SSE2) to 4 pixels (AVX-512) at the same time, saving a lot of CPU cycles.
darktable has 3 version of its IOPs : pure C (scalar), SSE2 (vectorized for 4 floats) and OpenCL (vectorized on GPU). That triggers some redundancy in the code. However, modern compilers and the OpenMP library have auto-vectorization options that could optimize pure C, provided the code is written in a vectorizable way and uses some pragmas to give hints to the compiler.
Write vectorizable code : https://info.ornl.gov/sites/publications/files/Pub69214.pdf
Best practices for auto-vectorization:
- avoid branches in loops that change the control flow. Use inline statements like
absolute = (x > 0) ? x : -x;
so they can be converted to bytes masks in SIMD, - pixels should only be referenced from the base pointer of their array and the indices of the loops, not from implicit pointer increments, for example:
float *image = (float *)in;
for(size_t i= 0; i < height; ++i)
{
float *pixel = (float *)image + i * width;
for(size_t j = 0; j < width; ++j)
{
*pixel = whatever;
pixel++;
}
}
should be written :
float *const restrict image = (float *)in;
for(size_t i = 0; i < height; ++i)
{
for(size_t j = 0; j < width; ++j)
{
image[i * width + j] = whatever;
}
}
In the former, the address pointed by pixel
depends on the previous loop iteration, which prevents parallelization and vectorization, and also makes the code more difficult to read. The latter uses an indexing that only depends on i
and j
loop increments, avoids false aliasing, and is easier to read (we immediately spots the array indexing).
- avoid carrying
struct
arguments in functions called in loops, and unpack thestruct
members before the loop. Vectorization can't be performed on structures, but only onfloat
andint
scalars and arrays. Fore example:
typedef struct iop_data_t
{
float[4] pixel;
float factor;
} iop_data_t;
float foo(float x, struct iop_data_t *bar)
{
return bar->factor * (x + bar->pixel[0] + bar->pixel[1] + bar->pixel[2] + bar->pixel[3]);
}
void loop(const float *in, float *out, const size_t width, const size_t height, const struct iop_data_t bar)
{
for(size_t k = 0; k < height * width; ++k)
{
out[k] = foo(in[k], bar); // the non-vectorized function will be called at each iteration (expensive)
}
}
should be written:
typedef struct iop_data_t
{
float[4] pixel;
float factor;
} iop_data_t;
#ifdef _OPENMP
#pragma declare simd
#endif
/* declare the function vectorizable and inline it to avoid calls from within the loop */
inline float foo(const float x, const float pixel[4], const float factor)
{
float sum = x;
/* use a SIMD reduction to vectorize the sum */
#ifdef _OPENMP
#pragma omp simd aligned(pixel:16) reduction(+:sum)
#endif
for(size_t k = 0; k < 4; ++k)
sum += pixel[k];
return factor * sum;
}
void loop(const float *const restrict in, // use constant pointers and restrict keyword to avoid false-aliasing
float *const restrict out,
const size_t width, const size_t height, const struct iop_data_t bar)
{
/* unpack the struct members */
const float *const restrict pixel = bar->pixel;
const float factor = bar-> factor;
#ifdef _OPENMP
#pragma omp parallel for simd default(none) \
dt_omp_firstprivate(in, out, pixel, factor, width, height) \
schedule(simd:static) aligned(in, out:64)
#endif
for(size_t k = 0; k < height * width; ++k)
{
/*
* now the code of the function foo is copied inside the loop
* so we avoid functions calls
* and the compiler can vectorize the content of foo at the loop level
* for example, on AVX2 platforms, the compiler could optimize the function
* to process 16 elements of out and in at every loop step to save cycles.
*/
out[k] = foo(in[k], pixel, factor);
}
}
- if you use nested loops (e.g. loop on the width and height of the array), declare the pixel pointers in the innermost loop and use
collapse(2)
in the OpenMP pragma so the compiler will be able to optimize the cache/memory use and split the loop more evenly between the different threads, - use flat indexing of arrays whenever possible (
for(size_t k = 0 ; k < ch * width * height ; k += ch)
) instead of nested width/height/channels loops, - use the
restrict
keyword on image/pixels pointers to avoid aliasing and avoid inplace operations on pixels (*out
must always be different from*in
) so you don't trigger variable dependencies between threads - align arrays on 64 bytes and pixels on 16 bytes blocks so the memory is contiguous and the CPU can load full cache lines (and avoid segfaults),
- write small functions and optimize locally (one loop/function), using OpenMP and/or compiler pragmas,
- keep your code stupid simple, systematic and avoid smart-ass pointer arithmetic because it will only lead the compiler to detect variable dependencies and pointer aliasing where there are none,
- avoid types casts,
- declare input/output pointers as
*const
and variables asconst
to avoid false-sharing in parallel loops (usingshared(variable)
OpenMP pragma), - look at Rawtherapee source code because these guys got it right.
Modules are the interfaces for IOPs, i.e. image-processing filters stacked in the pixelpipe. IOPs can be found in src/iop and the IOP API can be found in the header src/iop/iop_api.h.
Most IOP have 3 variant of their pixel-filtering part:
- a pure C implementation, in
process()
- a C optimized version, with SSE2 intrinsics, in
process_sse2()
- an OpenCL version, offloading the computation to the GPU, in
process_opencl()
.
An example of a dummy IOP can be found in src/iop/useless.c and used as a boilerplate.
If you add a new IOP, be sure to add the C file in src/iop/CMakeLists.txt#L69 and deal with its priority in the pixelpipe by adding a new node in tools/iop_dependencies.py
darktable wiki is licensed under the Creative Commons BY-SA 4.0 terms.