Separate out render context upload from renderer creation. Upload ramps
to GPU buffer. Encode gradients to scene description. Fix a number of
bugs in uploading and processing.
This renders gradients in a test image, but has some shortcomings. For
one, staging buffers need to be applied for a couple things (they're
just host mapped for now). Also, the interaction between sRGB and
premultiplied alpha isn't quite right. The size of the gradient ramp
buffer is fixed and should be dynamic.
And of course there's always more optimization to be done, including
making the upload of gradient ramps more incremental, and probably
hashing of the stops instead of the processed ramps.
WIP. Most of the GPU-side work should be done (though it's not tested
end-to-end and it's certainly possible I missed something), but still
needs work on encoding side.
The compute shaders have a check for the succesful completion of their
preceding stage. However, consider a shader execution path like the
following:
void main()
if (mem_error != NO_ERROR) {
return;
}
...
malloc(...);
...
barrier();
...
}
and shader execution that fails to allocate memory, thereby setting
mem_error to ERR_MALLOC_FAILED in malloc before reaching the barrier. If
another shader execution then begins execution, its mem_eror check will
make it return early and not reach the barrier.
All GPU APIs require (dynamically) uniform control flow for barriers,
and the above case may lead to GPU hangs in practice.
Fix this issue by replacing the early exits with careful checks that
don't interrupt barrier control flow.
Unfortunately, it's harder to prove the soundness of the new checks, so
this change also clears dynamic memory ranges in MEM_DEBUG mode when
memory is exhausted. The result is that accessing memory after
exhaustion triggers an error.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
The BeginClip and EndClip bounding boxes are absolute and must pairwise
match. I mistakenly modified the BeginClip bounding box for stroked
clips.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
This change completes general support for stroked fills for clips and
images.
Annotated_size increases from 28 to 32, because of the linewidth field
added to AnnoImage. Stroked image fills are presumably rare, and if
memory pressure turns out to be a bottleneck, we could replace the
linewidth field with a separate AnnoLinewidth elements.
Updates #70
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Encode stroke vs fill as tag flags, thereby reducing the number of scene
elements. Encoding change only, no functional changes.
The previous Stroke and Fill commands are merged to one command,
FillColor. The encoding to annotated element is divergent, which is
fixed when annotated elements move to tag flags.
Updates #70
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Commit 9afa9b86b6 added Rust support for
encoding flags into elements. This change adds support to shaders by
introducing variant tag structs:
struct VariantTag {
uint tag;
uint flags;
}
and returning them from Variant_tag functions.
It also adds a flags argument to write functions for enum variants that
include TagFlags.
No functionality changes.
Updates #70
Signed-off-by: Elias Naur <mail@eliasnaur.com>
FillImage is like Fill, except that it takes its color from one or
more image atlases.
kernel4 uses a single image for non-Vulkan hosts, and the dynamic sized array
of image descriptors on Vulkan.
A previous version of this commit used textures. I think images are a better
choice for piet-gpu, for several reasons:
- Texture sampling, in particular textureGrad, is slow on lower spec devices
such as Google Pixel. Texture sampling is particularly slow and difficult to
implement for CPU fallbacks.
- Texture sampling need more parameters, in particular the full u,v
transformation matrix, leading to a large increase in the command size. Since
all commands use the same size, that memory penalty is paid by all scenes, not
just scenes with textures.
- It is unlikely that piet-gpu will support every kind of fill for every
client, because each kind must be added to kernel4.
With FillImage, a client will prepare the image(s) in separate shader stages,
sampling and applying transformations and special effects as needed. Textures
that align with the output pixel grid can be used directly, without
pre-processing.
Note that the pre-processing step can run concurrently with the piet-gpu pipeline;
Only the last stage, kernel4, needs the images.
Pre-processing most likely uses fixed function vertex/fragment programs,
which on some GPUs may run in parallel with piet-gpu's compute programs.
While here, fix a few validation errors:
- Explicitly enable EXT_descriptor_indexing, KHR_maintenance3,
KHR_get_physical_device_properties2.
- Specify a vkDescriptorSetVariableDescriptorCountAllocateInfo for
vkAllocateDescriptorSets. Otherwise, variable image2D arrays won't work (but
sampler2D arrays do, at least on my setup).
Updates #38
Signed-off-by: Elias Naur <mail@eliasnaur.com>
As described in #62, the non-deterministic scene monoid may result in
slightly different transformations for path segments in an otherwise
closed path.
This change ensures consistent transformation across paths in three steps.
First, absolute transformations computed by the scene monoid is stored
along with path segments and annotated elements.
Second, elements.comp no longer transforms path segments. Instead, each
segment is stored untransformed along with a reference to its absolute
transformation.
Finally, path_coarse performs the transformation of path segments.
Because all segments in a path share a single transformation reference,
the inconsistency in #62 is avoided.
Fixes#62
Signed-off-by: Elias Naur <mail@eliasnaur.com>
The NVIDIA shader compiler bug that forced splitting of the state struct
into primitive types is now fixed.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable
bounds and alignment checks for every memory read and write.
Notes:
- Deriving an Alloc from Path.tiles is unsound, but it's more trouble to
convert Path.tiles from TileRef to a variable sized Alloc.
- elements.comp note that "We should be able to use an array of structs but the
NV shader compiler doesn't seem to like it". If that's still relevant, does
the shared arrays of Allocs work?
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Merge all static and dynamic buffers to just one, "memory". Add a malloc
function for dynamic allocations.
Unify static allocation offsets into a "config" buffer containing scene setup
(number of paths, number of path segments), as well as the memory offsets of
the static allocations.
Finally, set an overflow flag when an allocation fail, and make sure to exit
shader execution as soon as that triggers. Add checks before beginning
execution in case the client wants to run two or more shaders before checking
the flag.
The "state" buffer is left alone because it needs zero'ing and because it is
accessed with the "volatile" keyword.
Fixes#40
Signed-off-by: Elias Naur <mail@eliasnaur.com>
The state header is only one word (flags), not two.
Move the partition atomic counter to a separate field instead of state[0],
simplifying state offset calculations.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
I realized there's a problem with encoding clip bboxes relative to the
current transform (see #36 for a more detailed explanation), so this is
changing it to absolute bboxes.
This more or less gets clips working. There are optimization
opportunities (all-clear and all-opaque mask tiles), and it doesn't deal
with overflow of the blend stack, but it seems to basically work.
The Vulkan and OpenGL specifications offer only weak forward progress guarantees, and
in practice several mobile devices fail to complete the decoupled lookback
spinloop without mitigation.
This patch implements Raph's suggestion from the "Forward Progress"
section from
https://raphlinus.github.io/gpu/2020/04/30/prefix-sum.html
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Expand the the final kernel4 stage to maintain a per-pixel mask.
Introduce two new path elements, FillMask and FillMaskInv, to fill
the mask. FillMask acts like Fill, while FillMaskInv fills the area
outside the path.
SVG clipPaths is then representable by a FillMaskInv(0.0) for every nested
path, preceded by a FillMask(1.0) to clear the mask.
The bounding box for FillMaskInv elements is the entire screen; tightening of
the bounding box is left for future work. Note that a fullscreen bounding
box is not hopelessly inefficient because completely filling a tile with
a mask is just a single CmdSolidMask per tile.
Fixes#30
Signed-off-by: Elias Naur <mail@eliasnaur.com>
The transformation determinant is signed, but we're only interested in
the absolute scale for transforming linewidths.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Path segments are unsorted, but other elements are using the same
sort-middle approach as before.
This is a checkpoint. At this point, there are unoptimized versions
of tile init and coarse path raster, but it isn't wired up into a
working pipeline. Also observing about a 3x performance regression in
element processing, which needs to be investigated.
Coarse rasterization wasn't entirely taking line width into account.
Also fix swizzle in matrix (not yet used). And fix missing End command
in ptcl output (hasn't been a problem because buffer was cleared).
Trying to fit it into the fancy monad doesn't really work, so use a
more straightforward approach to compute it from the aggregate.
Also add yEdge logic (basically copying piet-metal). With a fix to
ELEMENT_BINNING_RATIO (which I had simply gotten wrong), the example
renders almost correctly, with small bounding box artifacts.
Write the right_edge to the binning output.
More work on encoding the fill/stroke distinction and plumbing that
through the pipeline. This is a bit unsatisfying because of the code
duplication; having an extra fill/stroke bool might be better, but I
want to avoid making the structs bigger (this could be solved by
better packing in the struct encoding).
Fills are plumbed through to the last stage. Backdrop is WIP.
This should get the "right_edge" value for each segment plumbed through
to the binning phase. It also needs to be plumbed to coarse raster and
wired up there.
Also considering WIP because none of this logic has been tested yet.
As of this point, it mostly renders stroke outlines for tiger. Some
dropouts are because the scan in the elements pass doesn't do lookback
yet, others are probably a bug.