Merge all static and dynamic buffers to just one, "memory". Add a malloc
function for dynamic allocations.
Unify static allocation offsets into a "config" buffer containing scene setup
(number of paths, number of path segments), as well as the memory offsets of
the static allocations.
Finally, set an overflow flag when an allocation fail, and make sure to exit
shader execution as soon as that triggers. Add checks before beginning
execution in case the client wants to run two or more shaders before checking
the flag.
The "state" buffer is left alone because it needs zero'ing and because it is
accessed with the "volatile" keyword.
Fixes#40
Signed-off-by: Elias Naur <mail@eliasnaur.com>
The previous attempt to fix inconsistent intersections because of floating
point inaccuracy[0] missed two cases.
The first case is that for top intersections with the very first row would fail
the test
tag == PathSeg_FillCubic && y > y0 && xbackdrop < bbox.z
In particular, y is not larger than y0 when y0 has been clipped to 0.
Fix that by re-introducing the min(p0.y, p1.y) < tile_y0 check that does work
and is just as consistent. Add similar check, min(p0.x, p1.x) < tile_x0, for
deciding when to clip the segment to the left edge (but keep consistent xray check
for deciding left edge *intersections*).
The second case is that the tracking left intersections in the [xray, next_xray]
range of tiles may fail when next_xray is forced to last_xray, the final xray value.
Fix that case by computing next_xray explicitly, before looping over the
x tiles. The code is now much simpler.
Finally, ensure that xx0 and xx1 doesn't overflow the allocated number of tiles
by clamping them *after* setting them. Adjust xx0 to include xray, just as xx1
is adjusted; I haven't seen corruption without it, but it's not obvious xx0
always includes xray.
While here, replace a "+=" on a guaranteed zero value to just "=".
Updates #23
[0] 29cfb8b63e
Signed-off-by: Elias Naur <mail@eliasnaur.com>
The state header is only one word (flags), not two.
Move the partition atomic counter to a separate field instead of state[0],
simplifying state offset calculations.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
The finite precision of floating point computations can lead the coarse
renderer into inconsistent tile intersections, which implies impossible line
segments such as lines with gaps or double intersections. The winding number
algorithm is sensitive to these errors which show up as incorrectly filled
paths.
This change forces all intersections to be consistent.
First, the floating point top edge intersection test is removed; top edge intersections are
completely determined by left edge intersections.
Then, left edge intersections are inserted from the tile with the last top edge
intersection. The next top edge is then fixed to be the last tile with a left
edge intersection.
More details in the patch comments.
Fixes#23
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Eliminates the precision loss of the subtraction in the sign(end.x - start.x)
expression in kernel4. That's important for the next change that avoids
inconsistent line intersections in path_coarse.
Updates #23
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Provide images to fine rasterization kernel as readonly textures with a
sampler, rather than storage images. That lets us use the GPU's hardware
for sampling, which should be considerably more efficient.
There are a bunch of parameters that are hardcoded, but it does seem to
work.
This patch passes a dynamically sized array of textures to the fine
rasterizer.
A bunch of the low level Vulkan stuff is done, but only enough of the
shaders and encoders to do minimal testing. We'll want to switch from
storage images to sampled images, track the actual array of textures
during encoding, use that to build the descriptor set (which will need
to be more dynamic), and of course run image elements through the
pipeline.
Progress towards #38
We keep a small window of the clip stack in registers in the fine
rasterization kernel, and when that window is exceeded, spill to global
memory, so the clip stack can be unbounded.
I realized there's a problem with encoding clip bboxes relative to the
current transform (see #36 for a more detailed explanation), so this is
changing it to absolute bboxes.
This more or less gets clips working. There are optimization
opportunities (all-clear and all-opaque mask tiles), and it doesn't deal
with overflow of the blend stack, but it seems to basically work.
Actually handle transforms in RenderCtx (was implemented in renderer but
not actually plumbed through). This also requires maintaining a state
stack, which will also be required for clipping.
This PR also starts work on encoding clipping, including tracking
bounding boxes.
WIP, none of this is tested yet.
The Vulkan and OpenGL specifications offer only weak forward progress guarantees, and
in practice several mobile devices fail to complete the decoupled lookback
spinloop without mitigation.
This patch implements Raph's suggestion from the "Forward Progress"
section from
https://raphlinus.github.io/gpu/2020/04/30/prefix-sum.html
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Expand the the final kernel4 stage to maintain a per-pixel mask.
Introduce two new path elements, FillMask and FillMaskInv, to fill
the mask. FillMask acts like Fill, while FillMaskInv fills the area
outside the path.
SVG clipPaths is then representable by a FillMaskInv(0.0) for every nested
path, preceded by a FillMask(1.0) to clear the mask.
The bounding box for FillMaskInv elements is the entire screen; tightening of
the bounding box is left for future work. Note that a fullscreen bounding
box is not hopelessly inefficient because completely filling a tile with
a mask is just a single CmdSolidMask per tile.
Fixes#30
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Both the Vulkan and OpenGL ES spec allow implementations to limit workgroups to
128 threads. Add a LG_WG_FACTOR setting for easy switching between 128 and 256
threads, with 256 being kept as the default setting.
Manually tested that LG_WG_FACTOR = 0 (128 threads) works as expected.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
The transformation determinant is signed, but we're only interested in
the absolute scale for transforming linewidths.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
This is a bit of a revert of the load-balanced ("more parallel") coarse
path rasterizer, but includes fills and also uses atomicExchange.
I'm doing it this way because it should be considerably easier to do
flattening in this structure, even though there will be some performance
regression.