Commit graph

152 commits

Author SHA1 Message Date
Raph Levien 8015eb25a1 Also fix write-after-read in elements.com
On further testing, this resolves a hard lockup on Intel 630 on the
mmark stress test, so is worth getting into the repo.
2021-11-14 08:23:37 -08:00
Raph Levien 95aad3e6c7 Put memory barrier reliably before flag write 2021-11-02 13:02:12 -07:00
Raph Levien e50d5c1f58 Add memory barrier to elements shader
The flag read needs acquire semantics. There are a number of ways that
could be expressed, but a generally portable way is to have a barrier
after. However, in the translation to Metal, that barrier needs to be in
uniform control flow. This patch does some workarounds to ensure that.
2021-11-02 12:50:11 -07:00
Elias Naur 039cfcf0de piet-gpu/shader: treat memoryBarrierBuffer as a control barrier
memoryBarrierBuffer is mapped to the threadgroup_barrier function in
Metal, which is a control barrier that must be executed by all threads
(or none). This change establishes that property for the two memory
barriers we have.

While here, remove ENABLE_IMAGE_INDICES completely; it was disabled in
an earlier change.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-08-20 20:41:35 +02:00
Raph Levien 59728868de Merge branch 'master' into gradient 2021-08-16 10:53:19 -07:00
Raph Levien 05e81acebc Basically get gradients working
Separate out render context upload from renderer creation. Upload ramps
to GPU buffer. Encode gradients to scene description. Fix a number of
bugs in uploading and processing.

This renders gradients in a test image, but has some shortcomings. For
one, staging buffers need to be applied for a couple things (they're
just host mapped for now). Also, the interaction between sRGB and
premultiplied alpha isn't quite right. The size of the gradient ramp
buffer is fixed and should be dynamic.

And of course there's always more optimization to be done, including
making the upload of gradient ramps more incremental, and probably
hashing of the stops instead of the processed ramps.
2021-08-09 16:16:46 -07:00
Raph Levien 3af033f71f
Merge pull request #108 from linebender/path_hang2
Retain subdivision results
2021-07-19 10:22:55 -07:00
Raph Levien 62df7c0bd5 Remove leftover debug stuff
In response to review by Elias.
2021-07-19 08:39:44 -07:00
Raph Levien 29a8975a9a Retain subdivision results
Don't recompute the parameters from quadratic subdivision, but rather
retain them across the two phases (summing the subdivision estimate, and
generating the subdivisions). The motivation for this is that the values
were subtly different (differing by 1 or 2 least signficant bits) across
the two phases. It *might* also be faster depending on ALU/memory
relative performance.

Fixes #107
2021-07-15 11:18:48 -07:00
Raph Levien 6f707c4c62 Start work on gradients
WIP. Most of the GPU-side work should be done (though it's not tested
end-to-end and it's certainly possible I missed something), but still
needs work on encoding side.
2021-07-12 06:56:52 -07:00
Ishi Tatsuyuki 7a2dc37d36 Remove manual blend stack spilling and rely on scratch memory instead
v2: Add a panic when the nested blend depth exceeds the limit.
v3: Rebase and partially remove code introduced in 22507de.
2021-06-25 17:13:01 +09:00
Ishi Tatsuyuki d77dfb8c00 Runtime querying of threadgroup size 2021-06-08 16:29:40 +09:00
Ishi Tatsuyuki c2772ceac7 Boost backdrop parallelism for the prefix sums 2021-06-08 15:09:32 +09:00
Elias Naur 4b59525e1f use mediump precision for kernel4 colors and areas
Improves kernel4 performance for a Gio scene from ~22ms to ~15ms.

Updates #83

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-04-20 10:15:42 +02:00
Elias Naur d9d518b248 avoid non-uniform barrier control flow when exhausting memory
The compute shaders have a check for the succesful completion of their
preceding stage. However, consider a shader execution path like the
following:

	void main()
		if (mem_error != NO_ERROR) {
		    return;
		}
		...
		malloc(...);
		...
		barrier();
		...
	}

and  shader execution that fails to allocate memory, thereby setting
mem_error to ERR_MALLOC_FAILED in malloc before reaching the barrier. If
another shader execution then begins execution, its mem_eror check will
make it return early and not reach the barrier.

All GPU APIs require (dynamically) uniform control flow for barriers,
and the above case may lead to GPU hangs in practice.

Fix this issue by replacing the early exits with careful checks that
don't interrupt barrier control flow.

Unfortunately, it's harder to prove the soundness of the new checks, so
this change also clears dynamic memory ranges in MEM_DEBUG mode when
memory is exhausted. The result is that accessing memory after
exhaustion triggers an error.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-04-20 10:15:29 +02:00
Elias Naur 3b4a72deb9 elements.comp: remove redundant assignment
The assignment was made redundant by eb86456f31.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-04-20 10:14:04 +02:00
Raph Levien 1c842f8471 Merge branch 'master' into ext_query 2021-04-11 15:33:49 -07:00
Elias Naur 45ea43c157 kernel4: replace continue in switch to support D3D11 shader model 5.0
Without this change, the fxc.exe compiler complains

error X3708: continue cannot be used in a switch

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-04-11 21:49:57 +02:00
Raph Levien 01e4024599 Merge branch 'master' into ext_query 2021-04-11 09:08:46 -07:00
Raph Levien 115cb855d9 Query extensions at runtime
Don't run extensions unless they're available. This includes querying
for descriptor indexing, and running one of two versions of kernel4
depending on whether it's enabled.

Part of the support needed for #78
2021-04-08 15:11:15 -07:00
Elias Naur eb86456f31 elements.comp: don't modify BeginClip bounding box
The BeginClip and EndClip bounding boxes are absolute and must pairwise
match. I mistakenly modified the BeginClip bounding box for stroked
clips.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-04-08 19:56:37 +02:00
Elias Naur 5db427c549 kernel4: compute and output alpha
Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-31 19:51:49 +02:00
Elias Naur ee4429a26f kernel4: separate area from alpha in clip stack
This change prepares for kernel4 to output alpha. No functional changes.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-31 19:51:42 +02:00
Elias Naur 22507dea0e pre-allocate kernel4 scratch space in coarse.comp
coarse.comp knows the maximum stack depth, and can pre-allocate scratch
space for kernel4.comp. Kernel4 no longer contains allocations nor
control barriers.

The invocation local blend stack is gone as well; it didn't seem to make
any difference in performance to always use global memory for pushing
and popping.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-31 18:48:19 +02:00
Elias Naur e6b535d942 coarse.comp: extract area commands into function
No functional changes.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-30 19:56:09 +02:00
Elias Naur d916a9e2c4 backdrop.comp: support stroked Annotated_Image and Annotated_BeginClip
Commit 8db77e180e added support for
strokes to FillImage and BeginClip, but missed backdrop.comp.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-30 19:33:25 +02:00
Elias Naur 678bfedfca kernel4: assume colors in alpha-premultiplied sRGB format
See http://ssp.impulsetrain.com/gamma-premult.html for a description of
the format.

Pre-multiplied alpha only matters for translucent objects; draw a few
such shapes in the test render.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-29 21:17:01 +02:00
Elias Naur eb37db1b05 replace per-element fill mode flags with a SetFillMode element
Fixes #70

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-29 21:10:25 +02:00
Elias Naur bb61f875dc kernel4: remove dead code left over from previous clipping approach
Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-29 21:10:17 +02:00
Tatsuyuki Ishi 4864a7fe0f Create chunks over the x axis in addition to y axis
This allows more coalescing with image loads/stores, since all of our images are stored with a tiled layout.
2021-03-23 20:54:49 +09:00
Elias Naur f0127812eb tightly pack fine rasterizer commands
Reclaims the space waste from splitting fill mode commands from fill
commands.

For example, a CmdStroke + CmdColor use an extra tag word compared to
the former combined CmdStroke. This change shaves off that one word.

In the future, we can pack several command tags into one tag word,
saving even more space.

Fixes #66

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 16:43:33 +01:00
Elias Naur 8db77e180e support stroked fills for clips, images
This change completes general support for stroked fills for clips and
images.

Annotated_size increases from 28 to 32, because of the linewidth field
added to AnnoImage. Stroked image fills are presumably rare, and if
memory pressure turns out to be a bottleneck, we could replace the
linewidth field with a separate AnnoLinewidth elements.

Updates #70

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 16:43:33 +01:00
Elias Naur db59b5d570 coarse,kernel4: make stroke, (non-zero) fill, solid separate commands
Before this change, every command (FillColor, FillImage, BeginClip)
had (or would need) stroke, (non-zero) fill and solid variants.

This change adds a command for each fill mode and their parameters,
reducing code duplication and adds support for stroked FillImage and
BeginClip as a side-effect.

The rest of the pipeline doesn't yet support Stroked FillImage and
BeginClip. That's a follow-up change.

Since each command includes a tag, this change adds an extra word for
each fill and stroke. That waste is also addressed in a follow-up.

Updates #70

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 16:43:33 +01:00
Elias Naur 44bff2726c collapse FillCubic and StrokeCubic into Cubic with flags for fill mode
Updates #70

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 12:50:12 +01:00
Elias Naur df055563bd collapse annotated Fill and Stroke to Color with fill mode flag
No functionality changes, just different encoding.

Updates #70

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 12:50:12 +01:00
Elias Naur e9ff509ab9 use tag flags for fill vs stroke modes in scene elements
Encode stroke vs fill as tag flags, thereby reducing the number of scene
elements. Encoding change only, no functional changes.

The previous Stroke and Fill commands are merged to one command,
FillColor. The encoding to annotated element is divergent, which is
fixed when annotated elements move to tag flags.

Updates #70

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 12:50:12 +01:00
Elias Naur a5b6bda941 add support for element flags to shaders
Commit 9afa9b86b6 added Rust support for
encoding flags into elements. This change adds support to shaders by
introducing variant tag structs:

struct VariantTag {
    uint tag;
    uint flags;
}

and returning them from Variant_tag functions.

It also adds a flags argument to write functions for enum variants that
include TagFlags.

No functionality changes.

Updates #70

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 12:50:12 +01:00
Elias Naur 903ab1fb59 implement FillImage command and sRGB support
FillImage is like Fill, except that it takes its color from one or
more image atlases.

kernel4 uses a single image for non-Vulkan hosts, and the dynamic sized array
of image descriptors on Vulkan.

A previous version of this commit used textures. I think images are a better
choice for piet-gpu, for several reasons:

- Texture sampling, in particular textureGrad, is slow on lower spec devices
  such as Google Pixel. Texture sampling is particularly slow and difficult to
implement for CPU fallbacks.
- Texture sampling need more parameters, in particular the full u,v
  transformation matrix, leading to a large increase in the command size. Since
all commands use the same size, that memory penalty is paid by all scenes, not
just scenes with textures.
- It is unlikely that piet-gpu will support every kind of fill for every
  client, because each kind must be added to kernel4.

With FillImage, a client will prepare the image(s) in separate shader stages,
sampling and applying transformations and special effects as needed. Textures
that align with the output pixel grid can be used directly, without
pre-processing.

Note that the pre-processing step can run concurrently with the piet-gpu pipeline;
Only the last stage, kernel4, needs the images.

Pre-processing most likely uses fixed function vertex/fragment programs,
which on some GPUs may run in parallel with piet-gpu's compute programs.

While here, fix a few validation errors:
- Explicitly enable EXT_descriptor_indexing, KHR_maintenance3,
  KHR_get_physical_device_properties2.
- Specify a vkDescriptorSetVariableDescriptorCountAllocateInfo for
  vkAllocateDescriptorSets. Otherwise, variable image2D arrays won't work (but
sampler2D arrays do, at least on my setup).

Updates #38

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 12:50:12 +01:00
Elias Naur 07e07c7544 ensure consistent path segment transformation
As described in #62, the non-deterministic scene monoid may result in
slightly different transformations for path segments in an otherwise
closed path.

This change ensures consistent transformation across paths in three steps.

First, absolute transformations computed by the scene monoid is stored
along with path segments and annotated elements.

Second, elements.comp no longer transforms path segments. Instead, each
segment is stored untransformed along with a reference to its absolute
transformation.

Finally, path_coarse performs the transformation of path segments.
Because all segments in a path share a single transformation reference,
the inconsistency in #62 is avoided.

Fixes #62

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 12:45:23 +01:00
Elias Naur ad444f615c elements.comp: use shared array of structs directly
The NVIDIA shader compiler bug that forced splitting of the state struct
into primitive types is now fixed.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 12:45:23 +01:00
Elias Naur 79d722df48 remove unused commands from pathseg
Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-03-19 12:45:23 +01:00
Elias Naur b73eabf4eb kernel4.comp: remove unused commands
Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-02-24 15:32:24 +01:00
Elias Naur 6a4e26ef2a all: add optional memory checks
Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable
bounds and alignment checks for every memory read and write.

Notes:
- Deriving an Alloc from Path.tiles is unsound, but it's more trouble to
  convert Path.tiles from TileRef to a variable sized Alloc.
- elements.comp note that "We should be able to use an array of structs but the
  NV shader compiler doesn't seem to like it". If that's still relevant, does
  the shared arrays of Allocs work?

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2021-02-15 16:07:45 +01:00
Elias Naur ee67a0a515 kernel4: simplify a tiny bit
Signed-off-by: Elias Naur <mail@eliasnaur.com>
2020-12-27 20:24:29 +01:00
Elias Naur 716517cc04 coarse,binning: organize bins into width_in_bins x height_in_bins
The binning shader supports up to N_TILE bins. To efficiently cover wide or
tall viewports, convert the rigid N_TILE_X x N_TILE_Y bin layout to a variable
width_in_bins x height_in_bins layout.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2020-12-27 20:24:29 +01:00
Elias Naur ef4ec772ad backdrop: repair unsound optimization
Signed-off-by: Elias Naur <mail@eliasnaur.com>
2020-12-27 20:24:29 +01:00
Elias Naur 8b62022749 backdrop: avoid a (benign) zero-sized read
Found with MEM_DEBUG added in later change.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2020-12-27 20:24:29 +01:00
Elias Naur c4f5a69a0d implement variable output sizing
Signed-off-by: Elias Naur <mail@eliasnaur.com>
2020-12-27 20:24:29 +01:00
Elias Naur c67696714b coarse.comp: don't write Cmd_End to tiles out of bounds
If WIDTH_IN_TILES or HEIGHT_IN_TILES are not divisible by N_TILE_X or N_TILE_Y
respectively, the previously unconditional Cmd_End_write would write out of
bounds.

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2020-12-27 20:24:29 +01:00
Elias Naur 4de67d9081 unify GPU memory management
Merge all static and dynamic buffers to just one, "memory". Add a malloc
function for dynamic allocations.

Unify static allocation offsets into a "config" buffer containing scene setup
(number of paths, number of path segments), as well as the memory offsets of
the static allocations.

Finally, set an overflow flag when an allocation fail, and make sure to exit
shader execution as soon as that triggers. Add checks before beginning
execution in case the client wants to run two or more shaders before checking
the flag.

The "state" buffer is left alone because it needs zero'ing and because it is
accessed with the "volatile" keyword.

Fixes #40

Signed-off-by: Elias Naur <mail@eliasnaur.com>
2020-12-27 20:24:29 +01:00