The Vulkan and OpenGL specifications offer only weak forward progress guarantees, and
in practice several mobile devices fail to complete the decoupled lookback
spinloop without mitigation.
This patch implements Raph's suggestion from the "Forward Progress"
section from
https://raphlinus.github.io/gpu/2020/04/30/prefix-sum.html
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Expand the the final kernel4 stage to maintain a per-pixel mask.
Introduce two new path elements, FillMask and FillMaskInv, to fill
the mask. FillMask acts like Fill, while FillMaskInv fills the area
outside the path.
SVG clipPaths is then representable by a FillMaskInv(0.0) for every nested
path, preceded by a FillMask(1.0) to clear the mask.
The bounding box for FillMaskInv elements is the entire screen; tightening of
the bounding box is left for future work. Note that a fullscreen bounding
box is not hopelessly inefficient because completely filling a tile with
a mask is just a single CmdSolidMask per tile.
Fixes#30
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Both the Vulkan and OpenGL ES spec allow implementations to limit workgroups to
128 threads. Add a LG_WG_FACTOR setting for easy switching between 128 and 256
threads, with 256 being kept as the default setting.
Manually tested that LG_WG_FACTOR = 0 (128 threads) works as expected.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
The transformation determinant is signed, but we're only interested in
the absolute scale for transforming linewidths.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
This is a bit of a revert of the load-balanced ("more parallel") coarse
path rasterizer, but includes fills and also uses atomicExchange.
I'm doing it this way because it should be considerably easier to do
flattening in this structure, even though there will be some performance
regression.
Have a more-parallel read of the tile structures based on bbox coverage,
and only set the bit when the tile isn't empty.
This is a speedup, but there is some duplicated work and it is possible
to improve it further.
Path segments are unsorted, but other elements are using the same
sort-middle approach as before.
This is a checkpoint. At this point, there are unoptimized versions
of tile init and coarse path raster, but it isn't wired up into a
working pipeline. Also observing about a 3x performance regression in
element processing, which needs to be investigated.
In kernel 4, compute a chunk of pixels rather than just one per thread.
This is a dramatic speedup.
(This commit cherry-picked from another working branch)
Another speedup might be to special-case when the number of chunks in a
stroke or fill command is 1, then the segment header doesn't need
allocation and memory traffic is reduced. But right now we'll avoid the
complexity.
Coarse rasterization wasn't entirely taking line width into account.
Also fix swizzle in matrix (not yet used). And fix missing End command
in ptcl output (hasn't been a problem because buffer was cleared).
Trying to fit it into the fancy monad doesn't really work, so use a
more straightforward approach to compute it from the aggregate.
Also add yEdge logic (basically copying piet-metal). With a fix to
ELEMENT_BINNING_RATIO (which I had simply gotten wrong), the example
renders almost correctly, with small bounding box artifacts.
Write the right_edge to the binning output.
More work on encoding the fill/stroke distinction and plumbing that
through the pipeline. This is a bit unsatisfying because of the code
duplication; having an extra fill/stroke bool might be better, but I
want to avoid making the structs bigger (this could be solved by
better packing in the struct encoding).
Fills are plumbed through to the last stage. Backdrop is WIP.
This should get the "right_edge" value for each segment plumbed through
to the binning phase. It also needs to be plumbed to coarse raster and
wired up there.
Also considering WIP because none of this logic has been tested yet.
Compute tile coverage of segments using optimized algorithm. This
algorithm does a bit of setup, then uses an efficient formula to
compute the span per scan-line.
As of this point, it mostly renders stroke outlines for tiger. Some
dropouts are because the scan in the elements pass doesn't do lookback
yet, others are probably a bug.
This version seems to work but the allocation of segments has low
utilization. Probably best to allocate in chunks rather than try to
make them contiguous.
This just adds the first step of polyline stroking, which is adding it
to the scene. Also just a bit of cleaning up of dimensions into one
header file.
This patch adds a module that contains both scene and ptcl types (very
lightly adapted from piet-metal), as well as infrastructure for encoding
Rust-side.
WIP, it's not wired up in either the shader or on the Rust side.
Populates the piet-gpu subdir, with an extremely simple renderer. The
main program saves the image to a PNG.
Contains a few fixes (I was confused about the need for multiple
bindings, as opposed to multiple descriptors within a binding).