Provide images to fine rasterization kernel as readonly textures with a
sampler, rather than storage images. That lets us use the GPU's hardware
for sampling, which should be considerably more efficient.
There are a bunch of parameters that are hardcoded, but it does seem to
work.
This patch passes a dynamically sized array of textures to the fine
rasterizer.
A bunch of the low level Vulkan stuff is done, but only enough of the
shaders and encoders to do minimal testing. We'll want to switch from
storage images to sampled images, track the actual array of textures
during encoding, use that to build the descriptor set (which will need
to be more dynamic), and of course run image elements through the
pipeline.
Progress towards #38
We keep a small window of the clip stack in registers in the fine
rasterization kernel, and when that window is exceeded, spill to global
memory, so the clip stack can be unbounded.
I realized there's a problem with encoding clip bboxes relative to the
current transform (see #36 for a more detailed explanation), so this is
changing it to absolute bboxes.
This more or less gets clips working. There are optimization
opportunities (all-clear and all-opaque mask tiles), and it doesn't deal
with overflow of the blend stack, but it seems to basically work.
Actually handle transforms in RenderCtx (was implemented in renderer but
not actually plumbed through). This also requires maintaining a state
stack, which will also be required for clipping.
This PR also starts work on encoding clipping, including tracking
bounding boxes.
WIP, none of this is tested yet.
The hub does a little better lifetime tracking of resources (so
Rust-side references can be dropped), and in the future will be used for
dynamic selection of backend.
The migration is still a bit half-baked, as there are a bunch of
Vulkan-specific types in the signatures, but it shouldn't be too much
work to sort that out. Perhaps it can wait until there is a second
backend though.
The main motivation for this is to create image objects with lifetime
tracking, one of the things required for #38.
Update to latest versions of all dependencies. Among other things, this
gets us on piet 0.2, though almost all of the changes were around text,
which is not yet implemented.
The Vulkan and OpenGL specifications offer only weak forward progress guarantees, and
in practice several mobile devices fail to complete the decoupled lookback
spinloop without mitigation.
This patch implements Raph's suggestion from the "Forward Progress"
section from
https://raphlinus.github.io/gpu/2020/04/30/prefix-sum.html
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Expand the the final kernel4 stage to maintain a per-pixel mask.
Introduce two new path elements, FillMask and FillMaskInv, to fill
the mask. FillMask acts like Fill, while FillMaskInv fills the area
outside the path.
SVG clipPaths is then representable by a FillMaskInv(0.0) for every nested
path, preceded by a FillMask(1.0) to clear the mask.
The bounding box for FillMaskInv elements is the entire screen; tightening of
the bounding box is left for future work. Note that a fullscreen bounding
box is not hopelessly inefficient because completely filling a tile with
a mask is just a single CmdSolidMask per tile.
Fixes#30
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Both the Vulkan and OpenGL ES spec allow implementations to limit workgroups to
128 threads. Add a LG_WG_FACTOR setting for easy switching between 128 and 256
threads, with 256 being kept as the default setting.
Manually tested that LG_WG_FACTOR = 0 (128 threads) works as expected.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
The transformation determinant is signed, but we're only interested in
the absolute scale for transforming linewidths.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
This is a bit of a revert of the load-balanced ("more parallel") coarse
path rasterizer, but includes fills and also uses atomicExchange.
I'm doing it this way because it should be considerably easier to do
flattening in this structure, even though there will be some performance
regression.
Have a more-parallel read of the tile structures based on bbox coverage,
and only set the bit when the tile isn't empty.
This is a speedup, but there is some duplicated work and it is possible
to improve it further.
Path segments are unsorted, but other elements are using the same
sort-middle approach as before.
This is a checkpoint. At this point, there are unoptimized versions
of tile init and coarse path raster, but it isn't wired up into a
working pipeline. Also observing about a 3x performance regression in
element processing, which needs to be investigated.
In kernel 4, compute a chunk of pixels rather than just one per thread.
This is a dramatic speedup.
(This commit cherry-picked from another working branch)
Another speedup might be to special-case when the number of chunks in a
stroke or fill command is 1, then the segment header doesn't need
allocation and memory traffic is reduced. But right now we'll avoid the
complexity.
Coarse rasterization wasn't entirely taking line width into account.
Also fix swizzle in matrix (not yet used). And fix missing End command
in ptcl output (hasn't been a problem because buffer was cleared).
Trying to fit it into the fancy monad doesn't really work, so use a
more straightforward approach to compute it from the aggregate.
Also add yEdge logic (basically copying piet-metal). With a fix to
ELEMENT_BINNING_RATIO (which I had simply gotten wrong), the example
renders almost correctly, with small bounding box artifacts.
Write the right_edge to the binning output.
More work on encoding the fill/stroke distinction and plumbing that
through the pipeline. This is a bit unsatisfying because of the code
duplication; having an extra fill/stroke bool might be better, but I
want to avoid making the structs bigger (this could be solved by
better packing in the struct encoding).
Fills are plumbed through to the last stage. Backdrop is WIP.
This should get the "right_edge" value for each segment plumbed through
to the binning phase. It also needs to be plumbed to coarse raster and
wired up there.
Also considering WIP because none of this logic has been tested yet.
Compute tile coverage of segments using optimized algorithm. This
algorithm does a bit of setup, then uses an efficient formula to
compute the span per scan-line.
As of this point, it mostly renders stroke outlines for tiger. Some
dropouts are because the scan in the elements pass doesn't do lookback
yet, others are probably a bug.
This version seems to work but the allocation of segments has low
utilization. Probably best to allocate in chunks rather than try to
make them contiguous.
This just adds the first step of polyline stroking, which is adding it
to the scene. Also just a bit of cleaning up of dimensions into one
header file.
This patch adds a module that contains both scene and ptcl types (very
lightly adapted from piet-metal), as well as infrastructure for encoding
Rust-side.
WIP, it's not wired up in either the shader or on the Rust side.
Populates the piet-gpu subdir, with an extremely simple renderer. The
main program saves the image to a PNG.
Contains a few fixes (I was confused about the need for multiple
bindings, as opposed to multiple descriptors within a binding).