vello/piet-gpu/shader/backdrop.comp

// SPDX-License-Identifier: Apache-2.0 OR MIT OR Unlicense

// Propagation of tile backdrop for filling.
//
// Each thread reads one path element and calculates the number of spanned tiles
// based on the bounding box.
// In a further compaction step, the workgroup loops over the corresponding tile rows per element in parallel.
// For each row the per tile backdrop will be read, as calculated in the previous coarse path segment kernel,
// and propagated from the left to the right (prefix summed).
//
// Output state:
//  - Each path element has an array of tiles covering the whole path based on boundig box
//  - Each tile per path element contains the 'backdrop' and a list of subdivided path segments

#version 450
#extension GL_GOOGLE_include_directive : enable

#include "mem.h"
#include "setup.h"

#define LG_BACKDROP_WG (7 + LG_WG_FACTOR)
#define BACKDROP_WG (1 << LG_BACKDROP_WG)

layout(local_size_x = BACKDROP_WG, local_size_y = 1) in;

layout(set = 0, binding = 1) readonly buffer ConfigBuf {
    Config conf;
};

#include "annotated.h"
#include "tile.h"

shared uint sh_row_count[BACKDROP_WG];
shared Alloc sh_row_alloc[BACKDROP_WG];
shared uint sh_row_width[BACKDROP_WG];

void main() {
    if (mem_error != NO_ERROR) {
        return;
    }

    uint th_ix = gl_LocalInvocationID.x;
    uint element_ix = gl_GlobalInvocationID.x;
    AnnotatedRef ref = AnnotatedRef(conf.anno_alloc.offset + element_ix * Annotated_size);

    // Work assignment: 1 thread : 1 path element
    uint row_count = 0;
    if (element_ix < conf.n_elements) {
        AnnotatedTag tag = Annotated_tag(conf.anno_alloc, ref);
        switch (tag.tag) {
        case Annotated_Color:
            if (fill_mode_from_flags(tag.flags) != MODE_NONZERO) {
                break;
            }
            // Fall through.
        case Annotated_FillImage:
        case Annotated_BeginClip:
            PathRef path_ref = PathRef(conf.tile_alloc.offset + element_ix * Path_size);
            Path path = Path_read(conf.tile_alloc, path_ref);
            sh_row_width[th_ix] = path.bbox.z - path.bbox.x;
            row_count = path.bbox.w - path.bbox.y;
            // Paths that don't cross tile top edges don't have backdrops.
            // Don't apply the optimization to paths that may cross the y = 0
            // top edge, but clipped to 1 row.
            if (row_count == 1 && path.bbox.y > 0) {
                // Note: this can probably be expanded to width = 2 as
                // long as it doesn't cross the left edge.
                row_count = 0;
            }
            Alloc path_alloc = new_alloc(path.tiles.offset, (path.bbox.z - path.bbox.x) * (path.bbox.w - path.bbox.y) * Tile_size);
            sh_row_alloc[th_ix] = path_alloc;
        }
    }

    sh_row_count[th_ix] = row_count;
    // Prefix sum of sh_row_count
    for (uint i = 0; i < LG_BACKDROP_WG; i++) {
        barrier();
        if (th_ix >= (1 << i)) {
            row_count += sh_row_count[th_ix - (1 << i)];
        }
        barrier();
        sh_row_count[th_ix] = row_count;
    }
    barrier();
    // Work assignment: 1 thread : 1 path element row
    uint total_rows = sh_row_count[BACKDROP_WG - 1];
    for (uint row = th_ix; row < total_rows; row += BACKDROP_WG) {
        // Binary search to find element
        uint el_ix = 0;
        for (uint i = 0; i < LG_BACKDROP_WG; i++) {
            uint probe = el_ix + ((BACKDROP_WG / 2) >> i);
            if (row >= sh_row_count[probe - 1]) {
                el_ix = probe;
            }
        }
        uint width = sh_row_width[el_ix];
        if (width > 0) {
            // Process one row sequentially
            // Read backdrop value per tile and prefix sum it
            Alloc tiles_alloc = sh_row_alloc[el_ix];
            uint seq_ix = row - (el_ix > 0 ? sh_row_count[el_ix - 1] : 0);
            uint tile_el_ix = (tiles_alloc.offset >> 2) + 1 + seq_ix * 2 * width;
            uint sum = read_mem(tiles_alloc, tile_el_ix);
            for (uint x = 1; x < width; x++) {
                tile_el_ix += 2;
                sum += read_mem(tiles_alloc, tile_el_ix);
                write_mem(tiles_alloc, tile_el_ix, sum);
            }
        }
    }
}
all: add SPDX license headers Fixes #53 Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-12 01:01:48 +11:00			`// SPDX-License-Identifier: Apache-2.0 OR MIT OR Unlicense`

Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00			`// Propagation of tile backdrop for filling.`
Shader documentation and a slight cleanup 2020-06-28 23:37:27 +10:00			`//`
			`// Each thread reads one path element and calculates the number of spanned tiles`
			`// based on the bounding box.`
			`// In a further compaction step, the workgroup loops over the corresponding tile rows per element in parallel.`
			`// For each row the per tile backdrop will be read, as calculated in the previous coarse path segment kernel,`
			`// and propagated from the left to the right (prefix summed).`
			`//`
			`// Output state:`
			`// - Each path element has an array of tiles covering the whole path based on boundig box`
			`// - Each tile per path element contains the 'backdrop' and a list of subdivided path segments`
Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00
			`#version 450`
			`#extension GL_GOOGLE_include_directive : enable`

unify GPU memory management Merge all static and dynamic buffers to just one, "memory". Add a malloc function for dynamic allocations. Unify static allocation offsets into a "config" buffer containing scene setup (number of paths, number of path segments), as well as the memory offsets of the static allocations. Finally, set an overflow flag when an allocation fail, and make sure to exit shader execution as soon as that triggers. Add checks before beginning execution in case the client wants to run two or more shaders before checking the flag. The "state" buffer is left alone because it needs zero'ing and because it is accessed with the "volatile" keyword. Fixes #40 Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-12 04:30:20 +11:00			`#include "mem.h"`
all: add optional memory checks Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable bounds and alignment checks for every memory read and write. Notes: - Deriving an Alloc from Path.tiles is unsound, but it's more trouble to convert Path.tiles from TileRef to a variable sized Alloc. - elements.comp note that "We should be able to use an array of structs but the NV shader compiler doesn't seem to like it". If that's still relevant, does the shared arrays of Allocs work? Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 22:00:53 +11:00			`#include "setup.h"`
Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00
shader: introduce a crude setting for adjusting the maximum workgroup size Both the Vulkan and OpenGL ES spec allow implementations to limit workgroups to 128 threads. Add a LG_WG_FACTOR setting for easy switching between 128 and 256 threads, with 256 being kept as the default setting. Manually tested that LG_WG_FACTOR = 0 (128 threads) works as expected. Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-09-13 20:58:47 +10:00			`#define LG_BACKDROP_WG (7 + LG_WG_FACTOR)`
More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`#define BACKDROP_WG (1 << LG_BACKDROP_WG)`
Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00
			`layout(local_size_x = BACKDROP_WG, local_size_y = 1) in;`

unify GPU memory management Merge all static and dynamic buffers to just one, "memory". Add a malloc function for dynamic allocations. Unify static allocation offsets into a "config" buffer containing scene setup (number of paths, number of path segments), as well as the memory offsets of the static allocations. Finally, set an overflow flag when an allocation fail, and make sure to exit shader execution as soon as that triggers. Add checks before beginning execution in case the client wants to run two or more shaders before checking the flag. The "state" buffer is left alone because it needs zero'ing and because it is accessed with the "volatile" keyword. Fixes #40 Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-12 04:30:20 +11:00			`layout(set = 0, binding = 1) readonly buffer ConfigBuf {`
			`Config conf;`
Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00			`};`

			`#include "annotated.h"`
			`#include "tile.h"`

More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`shared uint sh_row_count[BACKDROP_WG];`
all: add optional memory checks Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable bounds and alignment checks for every memory read and write. Notes: - Deriving an Alloc from Path.tiles is unsound, but it's more trouble to convert Path.tiles from TileRef to a variable sized Alloc. - elements.comp note that "We should be able to use an array of structs but the NV shader compiler doesn't seem to like it". If that's still relevant, does the shared arrays of Allocs work? Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 22:00:53 +11:00			`shared Alloc sh_row_alloc[BACKDROP_WG];`
More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`shared uint sh_row_width[BACKDROP_WG];`

Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00			`void main() {`
all: add optional memory checks Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable bounds and alignment checks for every memory read and write. Notes: - Deriving an Alloc from Path.tiles is unsound, but it's more trouble to convert Path.tiles from TileRef to a variable sized Alloc. - elements.comp note that "We should be able to use an array of structs but the NV shader compiler doesn't seem to like it". If that's still relevant, does the shared arrays of Allocs work? Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 22:00:53 +11:00			`if (mem_error != NO_ERROR) {`
unify GPU memory management Merge all static and dynamic buffers to just one, "memory". Add a malloc function for dynamic allocations. Unify static allocation offsets into a "config" buffer containing scene setup (number of paths, number of path segments), as well as the memory offsets of the static allocations. Finally, set an overflow flag when an allocation fail, and make sure to exit shader execution as soon as that triggers. Add checks before beginning execution in case the client wants to run two or more shaders before checking the flag. The "state" buffer is left alone because it needs zero'ing and because it is accessed with the "volatile" keyword. Fixes #40 Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-12 04:30:20 +11:00			`return;`
			`}`

More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`uint th_ix = gl_LocalInvocationID.x;`
Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00			`uint element_ix = gl_GlobalInvocationID.x;`
all: add optional memory checks Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable bounds and alignment checks for every memory read and write. Notes: - Deriving an Alloc from Path.tiles is unsound, but it's more trouble to convert Path.tiles from TileRef to a variable sized Alloc. - elements.comp note that "We should be able to use an array of structs but the NV shader compiler doesn't seem to like it". If that's still relevant, does the shared arrays of Allocs work? Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 22:00:53 +11:00			`AnnotatedRef ref = AnnotatedRef(conf.anno_alloc.offset + element_ix * Annotated_size);`
Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00
Shader documentation and a slight cleanup 2020-06-28 23:37:27 +10:00			`// Work assignment: 1 thread : 1 path element`
More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`uint row_count = 0;`
unify GPU memory management Merge all static and dynamic buffers to just one, "memory". Add a malloc function for dynamic allocations. Unify static allocation offsets into a "config" buffer containing scene setup (number of paths, number of path segments), as well as the memory offsets of the static allocations. Finally, set an overflow flag when an allocation fail, and make sure to exit shader execution as soon as that triggers. Add checks before beginning execution in case the client wants to run two or more shaders before checking the flag. The "state" buffer is left alone because it needs zero'ing and because it is accessed with the "volatile" keyword. Fixes #40 Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-12 04:30:20 +11:00			`if (element_ix < conf.n_elements) {`
collapse annotated Fill and Stroke to Color with fill mode flag No functionality changes, just different encoding. Updates #70 Signed-off-by: Elias Naur <mail@eliasnaur.com> 2021-03-17 22:02:41 +11:00			`AnnotatedTag tag = Annotated_tag(conf.anno_alloc, ref);`
			`switch (tag.tag) {`
			`case Annotated_Color:`
			`if (fill_mode_from_flags(tag.flags) != MODE_NONZERO) {`
			`break;`
			`}`
			`// Fall through.`
implement FillImage command and sRGB support FillImage is like Fill, except that it takes its color from one or more image atlases. kernel4 uses a single image for non-Vulkan hosts, and the dynamic sized array of image descriptors on Vulkan. A previous version of this commit used textures. I think images are a better choice for piet-gpu, for several reasons: - Texture sampling, in particular textureGrad, is slow on lower spec devices such as Google Pixel. Texture sampling is particularly slow and difficult to implement for CPU fallbacks. - Texture sampling need more parameters, in particular the full u,v transformation matrix, leading to a large increase in the command size. Since all commands use the same size, that memory penalty is paid by all scenes, not just scenes with textures. - It is unlikely that piet-gpu will support every kind of fill for every client, because each kind must be added to kernel4. With FillImage, a client will prepare the image(s) in separate shader stages, sampling and applying transformations and special effects as needed. Textures that align with the output pixel grid can be used directly, without pre-processing. Note that the pre-processing step can run concurrently with the piet-gpu pipeline; Only the last stage, kernel4, needs the images. Pre-processing most likely uses fixed function vertex/fragment programs, which on some GPUs may run in parallel with piet-gpu's compute programs. While here, fix a few validation errors: - Explicitly enable EXT_descriptor_indexing, KHR_maintenance3, KHR_get_physical_device_properties2. - Specify a vkDescriptorSetVariableDescriptorCountAllocateInfo for vkAllocateDescriptorSets. Otherwise, variable image2D arrays won't work (but sampler2D arrays do, at least on my setup). Updates #38 Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-29 08:02:39 +11:00			`case Annotated_FillImage:`
Continuing work on clips I realized there's a problem with encoding clip bboxes relative to the current transform (see #36 for a more detailed explanation), so this is changing it to absolute bboxes. This more or less gets clips working. There are optimization opportunities (all-clear and all-opaque mask tiles), and it doesn't deal with overflow of the blend stack, but it seems to basically work. 2020-11-21 04:26:02 +11:00			`case Annotated_BeginClip:`
all: add optional memory checks Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable bounds and alignment checks for every memory read and write. Notes: - Deriving an Alloc from Path.tiles is unsound, but it's more trouble to convert Path.tiles from TileRef to a variable sized Alloc. - elements.comp note that "We should be able to use an array of structs but the NV shader compiler doesn't seem to like it". If that's still relevant, does the shared arrays of Allocs work? Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 22:00:53 +11:00			`PathRef path_ref = PathRef(conf.tile_alloc.offset + element_ix * Path_size);`
			`Path path = Path_read(conf.tile_alloc, path_ref);`
More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`sh_row_width[th_ix] = path.bbox.z - path.bbox.x;`
			`row_count = path.bbox.w - path.bbox.y;`
backdrop: repair unsound optimization Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 21:41:19 +11:00			`// Paths that don't cross tile top edges don't have backdrops.`
			`// Don't apply the optimization to paths that may cross the y = 0`
			`// top edge, but clipped to 1 row.`
			`if (row_count == 1 && path.bbox.y > 0) {`
More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`// Note: this can probably be expanded to width = 2 as`
			`// long as it doesn't cross the left edge.`
			`row_count = 0;`
			`}`
all: add optional memory checks Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable bounds and alignment checks for every memory read and write. Notes: - Deriving an Alloc from Path.tiles is unsound, but it's more trouble to convert Path.tiles from TileRef to a variable sized Alloc. - elements.comp note that "We should be able to use an array of structs but the NV shader compiler doesn't seem to like it". If that's still relevant, does the shared arrays of Allocs work? Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 22:00:53 +11:00			`Alloc path_alloc = new_alloc(path.tiles.offset, (path.bbox.z - path.bbox.x) * (path.bbox.w - path.bbox.y) * Tile_size);`
			`sh_row_alloc[th_ix] = path_alloc;`
More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`}`
Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00			`}`
More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00
			`sh_row_count[th_ix] = row_count;`
			`// Prefix sum of sh_row_count`
			`for (uint i = 0; i < LG_BACKDROP_WG; i++) {`
			`barrier();`
			`if (th_ix >= (1 << i)) {`
			`row_count += sh_row_count[th_ix - (1 << i)];`
			`}`
			`barrier();`
			`sh_row_count[th_ix] = row_count;`
			`}`
			`barrier();`
Shader documentation and a slight cleanup 2020-06-28 23:37:27 +10:00			`// Work assignment: 1 thread : 1 path element row`
More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`uint total_rows = sh_row_count[BACKDROP_WG - 1];`
			`for (uint row = th_ix; row < total_rows; row += BACKDROP_WG) {`
			`// Binary search to find element`
			`uint el_ix = 0;`
			`for (uint i = 0; i < LG_BACKDROP_WG; i++) {`
			`uint probe = el_ix + ((BACKDROP_WG / 2) >> i);`
			`if (row >= sh_row_count[probe - 1]) {`
			`el_ix = probe;`
Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00			`}`
			`}`
More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`uint width = sh_row_width[el_ix];`
backdrop: avoid a (benign) zero-sized read Found with MEM_DEBUG added in later change. Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 21:36:05 +11:00			`if (width > 0) {`
			`// Process one row sequentially`
			`// Read backdrop value per tile and prefix sum it`
all: add optional memory checks Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable bounds and alignment checks for every memory read and write. Notes: - Deriving an Alloc from Path.tiles is unsound, but it's more trouble to convert Path.tiles from TileRef to a variable sized Alloc. - elements.comp note that "We should be able to use an array of structs but the NV shader compiler doesn't seem to like it". If that's still relevant, does the shared arrays of Allocs work? Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 22:00:53 +11:00			`Alloc tiles_alloc = sh_row_alloc[el_ix];`
backdrop: avoid a (benign) zero-sized read Found with MEM_DEBUG added in later change. Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 21:36:05 +11:00			`uint seq_ix = row - (el_ix > 0 ? sh_row_count[el_ix - 1] : 0);`
all: add optional memory checks Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable bounds and alignment checks for every memory read and write. Notes: - Deriving an Alloc from Path.tiles is unsound, but it's more trouble to convert Path.tiles from TileRef to a variable sized Alloc. - elements.comp note that "We should be able to use an array of structs but the NV shader compiler doesn't seem to like it". If that's still relevant, does the shared arrays of Allocs work? Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 22:00:53 +11:00			`uint tile_el_ix = (tiles_alloc.offset >> 2) + 1 + seq_ix * 2 * width;`
			`uint sum = read_mem(tiles_alloc, tile_el_ix);`
backdrop: avoid a (benign) zero-sized read Found with MEM_DEBUG added in later change. Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 21:36:05 +11:00			`for (uint x = 1; x < width; x++) {`
			`tile_el_ix += 2;`
all: add optional memory checks Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable bounds and alignment checks for every memory read and write. Notes: - Deriving an Alloc from Path.tiles is unsound, but it's more trouble to convert Path.tiles from TileRef to a variable sized Alloc. - elements.comp note that "We should be able to use an array of structs but the NV shader compiler doesn't seem to like it". If that's still relevant, does the shared arrays of Allocs work? Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 22:00:53 +11:00			`sum += read_mem(tiles_alloc, tile_el_ix);`
			`write_mem(tiles_alloc, tile_el_ix, sum);`
backdrop: avoid a (benign) zero-sized read Found with MEM_DEBUG added in later change. Signed-off-by: Elias Naur <mail@eliasnaur.com> 2020-12-24 21:36:05 +11:00			`}`
More parallel backdrop propagation This is a nice improvement but still not great on tiger. 2020-06-07 01:23:40 +10:00			`}`
Make fills work The backdrop propagation is slow but it does work. 2020-06-06 08:07:02 +10:00			`}`
			`}`