WIP. Most of the GPU-side work should be done (though it's not tested
end-to-end and it's certainly possible I missed something), but still
needs work on encoding side.
The compute shaders have a check for the succesful completion of their
preceding stage. However, consider a shader execution path like the
following:
void main()
if (mem_error != NO_ERROR) {
return;
}
...
malloc(...);
...
barrier();
...
}
and shader execution that fails to allocate memory, thereby setting
mem_error to ERR_MALLOC_FAILED in malloc before reaching the barrier. If
another shader execution then begins execution, its mem_eror check will
make it return early and not reach the barrier.
All GPU APIs require (dynamically) uniform control flow for barriers,
and the above case may lead to GPU hangs in practice.
Fix this issue by replacing the early exits with careful checks that
don't interrupt barrier control flow.
Unfortunately, it's harder to prove the soundness of the new checks, so
this change also clears dynamic memory ranges in MEM_DEBUG mode when
memory is exhausted. The result is that accessing memory after
exhaustion triggers an error.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Don't run extensions unless they're available. This includes querying
for descriptor indexing, and running one of two versions of kernel4
depending on whether it's enabled.
Part of the support needed for #78
coarse.comp knows the maximum stack depth, and can pre-allocate scratch
space for kernel4.comp. Kernel4 no longer contains allocations nor
control barriers.
The invocation local blend stack is gone as well; it didn't seem to make
any difference in performance to always use global memory for pushing
and popping.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
See http://ssp.impulsetrain.com/gamma-premult.html for a description of
the format.
Pre-multiplied alpha only matters for translucent objects; draw a few
such shapes in the test render.
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Reclaims the space waste from splitting fill mode commands from fill
commands.
For example, a CmdStroke + CmdColor use an extra tag word compared to
the former combined CmdStroke. This change shaves off that one word.
In the future, we can pack several command tags into one tag word,
saving even more space.
Fixes#66
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Before this change, every command (FillColor, FillImage, BeginClip)
had (or would need) stroke, (non-zero) fill and solid variants.
This change adds a command for each fill mode and their parameters,
reducing code duplication and adds support for stroked FillImage and
BeginClip as a side-effect.
The rest of the pipeline doesn't yet support Stroked FillImage and
BeginClip. That's a follow-up change.
Since each command includes a tag, this change adds an extra word for
each fill and stroke. That waste is also addressed in a follow-up.
Updates #70
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Commit 9afa9b86b6 added Rust support for
encoding flags into elements. This change adds support to shaders by
introducing variant tag structs:
struct VariantTag {
uint tag;
uint flags;
}
and returning them from Variant_tag functions.
It also adds a flags argument to write functions for enum variants that
include TagFlags.
No functionality changes.
Updates #70
Signed-off-by: Elias Naur <mail@eliasnaur.com>
FillImage is like Fill, except that it takes its color from one or
more image atlases.
kernel4 uses a single image for non-Vulkan hosts, and the dynamic sized array
of image descriptors on Vulkan.
A previous version of this commit used textures. I think images are a better
choice for piet-gpu, for several reasons:
- Texture sampling, in particular textureGrad, is slow on lower spec devices
such as Google Pixel. Texture sampling is particularly slow and difficult to
implement for CPU fallbacks.
- Texture sampling need more parameters, in particular the full u,v
transformation matrix, leading to a large increase in the command size. Since
all commands use the same size, that memory penalty is paid by all scenes, not
just scenes with textures.
- It is unlikely that piet-gpu will support every kind of fill for every
client, because each kind must be added to kernel4.
With FillImage, a client will prepare the image(s) in separate shader stages,
sampling and applying transformations and special effects as needed. Textures
that align with the output pixel grid can be used directly, without
pre-processing.
Note that the pre-processing step can run concurrently with the piet-gpu pipeline;
Only the last stage, kernel4, needs the images.
Pre-processing most likely uses fixed function vertex/fragment programs,
which on some GPUs may run in parallel with piet-gpu's compute programs.
While here, fix a few validation errors:
- Explicitly enable EXT_descriptor_indexing, KHR_maintenance3,
KHR_get_physical_device_properties2.
- Specify a vkDescriptorSetVariableDescriptorCountAllocateInfo for
vkAllocateDescriptorSets. Otherwise, variable image2D arrays won't work (but
sampler2D arrays do, at least on my setup).
Updates #38
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable
bounds and alignment checks for every memory read and write.
Notes:
- Deriving an Alloc from Path.tiles is unsound, but it's more trouble to
convert Path.tiles from TileRef to a variable sized Alloc.
- elements.comp note that "We should be able to use an array of structs but the
NV shader compiler doesn't seem to like it". If that's still relevant, does
the shared arrays of Allocs work?
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Merge all static and dynamic buffers to just one, "memory". Add a malloc
function for dynamic allocations.
Unify static allocation offsets into a "config" buffer containing scene setup
(number of paths, number of path segments), as well as the memory offsets of
the static allocations.
Finally, set an overflow flag when an allocation fail, and make sure to exit
shader execution as soon as that triggers. Add checks before beginning
execution in case the client wants to run two or more shaders before checking
the flag.
The "state" buffer is left alone because it needs zero'ing and because it is
accessed with the "volatile" keyword.
Fixes#40
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Eliminates the precision loss of the subtraction in the sign(end.x - start.x)
expression in kernel4. That's important for the next change that avoids
inconsistent line intersections in path_coarse.
Updates #23
Signed-off-by: Elias Naur <mail@eliasnaur.com>
Provide images to fine rasterization kernel as readonly textures with a
sampler, rather than storage images. That lets us use the GPU's hardware
for sampling, which should be considerably more efficient.
There are a bunch of parameters that are hardcoded, but it does seem to
work.
This patch passes a dynamically sized array of textures to the fine
rasterizer.
A bunch of the low level Vulkan stuff is done, but only enough of the
shaders and encoders to do minimal testing. We'll want to switch from
storage images to sampled images, track the actual array of textures
during encoding, use that to build the descriptor set (which will need
to be more dynamic), and of course run image elements through the
pipeline.
Progress towards #38
We keep a small window of the clip stack in registers in the fine
rasterization kernel, and when that window is exceeded, spill to global
memory, so the clip stack can be unbounded.
I realized there's a problem with encoding clip bboxes relative to the
current transform (see #36 for a more detailed explanation), so this is
changing it to absolute bboxes.
This more or less gets clips working. There are optimization
opportunities (all-clear and all-opaque mask tiles), and it doesn't deal
with overflow of the blend stack, but it seems to basically work.
Actually handle transforms in RenderCtx (was implemented in renderer but
not actually plumbed through). This also requires maintaining a state
stack, which will also be required for clipping.
This PR also starts work on encoding clipping, including tracking
bounding boxes.
WIP, none of this is tested yet.
Expand the the final kernel4 stage to maintain a per-pixel mask.
Introduce two new path elements, FillMask and FillMaskInv, to fill
the mask. FillMask acts like Fill, while FillMaskInv fills the area
outside the path.
SVG clipPaths is then representable by a FillMaskInv(0.0) for every nested
path, preceded by a FillMask(1.0) to clear the mask.
The bounding box for FillMaskInv elements is the entire screen; tightening of
the bounding box is left for future work. Note that a fullscreen bounding
box is not hopelessly inefficient because completely filling a tile with
a mask is just a single CmdSolidMask per tile.
Fixes#30
Signed-off-by: Elias Naur <mail@eliasnaur.com>
In kernel 4, compute a chunk of pixels rather than just one per thread.
This is a dramatic speedup.
(This commit cherry-picked from another working branch)