vello

alex/vello

mirror of https://github.com/italicsjenga/vello.git synced 2025-01-11 04:51:32 +11:00

Author	SHA1	Message	Date
Raph Levien	240f44a228	Implement robust dynamic memory This is the core logic for robust dynamic memory. There are changes to both shaders and the driver logic. On the shader side, failure information is more useful and fine grained. In particular, it now reports which stage failed and how much memory would have been required to make that stage succeed. On the driver side, there is a new RenderDriver abstraction which owns command buffers (and associated query pools) and runs the logic to retry and reallocate buffers when necessary. There's also a fairly significant rework of the logic to produce the config block, as that overlaps the robust memory. The RenderDriver abstraction may not stay. It was done this way to minimize code disruption, but arguably it should just be combined with Renderer. Another change: the GLSL length() method on a buffer requires additional infrastructure (at least on Metal, where it needs a binding of its own), so we now pass that in as a field in the config. This also moves blend memory to its own buffer. This worked out well because coarse rasterization can simply report the size of the blend buffer and it can be reallocated without needing to rerun the pipeline. In the previous state, blend allocations and ptcl writes were interleaved in coarse rasterization, so a failure of the former would require rerunning coarse. This should fix #83 (finally!) There are a few loose ends. The binaries haven't (yet) been updated (I've been testing using a hand-written test program). Gradients weren't touched so still have a fixed size allocation. And the logic to calculate the new buffer size on allocation failure could be smarter. Closes #175	2022-07-13 12:34:51 -07:00
Raph Levien	e73049fe98	First cut at split blend stack Split the blend stack into register and memory segments. Do blending in registers up to that size, then spill to memory if needed. This version may regress performance on Pixel 4, as it uses common memory for the blend stack, rather than keeping that memory read-only in fine rasterization, and using a separate buffer for blend stack. This needs investigation. It's possible we'll want to have single common memory as a config option, as it pools allocations and decreases the probability of failure. Also a flaw in this version: there is no checking of memory overflow. For understanding code history: this commit largely reverts #77, but there were some intervening changes to blending, and this commit also implements the split so some of the stack is in registers. Closes #156	2022-05-16 11:12:33 -07:00
Raph Levien	acb3933d94	Variable size encoding of draw objects This patch switches to a variable size encoding of draw objects. In addition to the CPU-side scene encoding, it changes the representation of intermediate per draw object state from the `Annotated` struct to a variable "info" encoding. In addition, the bounding boxes are moved to a separate array (for a more "structure of "arrays" approach). Data that's unchanged from the scene encoding is not copied. Rather, downstream stages can access the data from the scene buffer (reducing allocation and copying). Prefix sums, computed in `DrawMonoid` track the offset of both scene and intermediate data. The tags for the CPU-side encoding have been split into their own stream (again a change from AoS to SoA style). This is not necessarily the final form. There's some stuff (including at least one piet-gpu-derive type) that can be deleted. In addition, the linewidth field should probably move from the info to path-specific. Also, the 1:1 correspondence between draw object and path has not yet been broken. Closes #152	2022-03-14 16:32:08 -07:00
Raph Levien	3b67a4e7c1	New clip implementation This PR reworks the clip implementation. The highlight is that clip bounding box accounting is now done on GPU rather than CPU. The clip mask is also rasterized on EndClip rather than BeginClip, which decreases memory traffic needed for the clip stack. This is a pretty good working state, but not all cleanup has been applied. An important next step is to remove the CPU clip accounting (it is computed and encoded, but that result is not used). Another step is to remove the Annotated structure entirely. Fixes #88. Also relevant to #119	2022-02-17 17:13:28 -08:00
Raph Levien	d948126c16	Adjust workgroup sizes Make max workgroup size 256 and respect LG_WG_FACTOR. Because the monoid scans only support a height of 2, this will reduce the maximum scene complexity we can render. But it also increases compatibility. Supporting larger scans is a TODO.	2021-12-08 11:48:38 -08:00
Raph Levien	44327fe49f	Beginnings of new element pipeline This successfully renders the tiger; fills and strokes are supported. Other parts of the imaging model, not yet. Progress toward #119	2021-12-03 15:33:01 -08:00
Raph Levien	875c8badf4	Add draw object stage This is one of the stages in the new element pipeline. It's a simple one, just a prefix sum of a couple counts, and some of it will probably get merged with a downstream stage, but we'll do it separately for now for convenience. This patch also contains an update to Vulkan tools 1.2.198, which accounts for the large diff of translated shaders.	2021-12-02 13:37:16 -08:00
Raph Levien	178761dcb3	Path stream processing This patch contains the core of the path stream processing, though some integration bits are missing. The core logic is tested, though combinations of path types, transforms, and line widths are not (yet). Progress towards #119	2021-12-01 07:33:24 -08:00
Raph Levien	47f8812e2f	Start work on new element pipeline There's a bit of reorganizing as well. Shader stages are made available from piet-gpu to the test rig, config is now a proper structure (marshaled with bytemuck). This commit just has the transform stage, which is a simple monoid scan of affine transforms. Progress toward #119	2021-11-24 08:01:43 -08:00
Elias Naur	039cfcf0de	piet-gpu/shader: treat memoryBarrierBuffer as a control barrier memoryBarrierBuffer is mapped to the threadgroup_barrier function in Metal, which is a control barrier that must be executed by all threads (or none). This change establishes that property for the two memory barriers we have. While here, remove ENABLE_IMAGE_INDICES completely; it was disabled in an earlier change. Signed-off-by: Elias Naur <mail@eliasnaur.com>	2021-08-20 20:41:35 +02:00
Raph Levien	6f707c4c62	Start work on gradients WIP. Most of the GPU-side work should be done (though it's not tested end-to-end and it's certainly possible I missed something), but still needs work on encoding side.	2021-07-12 06:56:52 -07:00
Raph Levien	115cb855d9	Query extensions at runtime Don't run extensions unless they're available. This includes querying for descriptor indexing, and running one of two versions of kernel4 depending on whether it's enabled. Part of the support needed for #78	2021-04-08 15:11:15 -07:00
Elias Naur	ee4429a26f	kernel4: separate area from alpha in clip stack This change prepares for kernel4 to output alpha. No functional changes. Signed-off-by: Elias Naur <mail@eliasnaur.com>	2021-03-31 19:51:42 +02:00
Elias Naur	e9ff509ab9	use tag flags for fill vs stroke modes in scene elements Encode stroke vs fill as tag flags, thereby reducing the number of scene elements. Encoding change only, no functional changes. The previous Stroke and Fill commands are merged to one command, FillColor. The encoding to annotated element is divergent, which is fixed when annotated elements move to tag flags. Updates #70 Signed-off-by: Elias Naur <mail@eliasnaur.com>	2021-03-19 12:50:12 +01:00
Elias Naur	903ab1fb59	implement FillImage command and sRGB support FillImage is like Fill, except that it takes its color from one or more image atlases. kernel4 uses a single image for non-Vulkan hosts, and the dynamic sized array of image descriptors on Vulkan. A previous version of this commit used textures. I think images are a better choice for piet-gpu, for several reasons: - Texture sampling, in particular textureGrad, is slow on lower spec devices such as Google Pixel. Texture sampling is particularly slow and difficult to implement for CPU fallbacks. - Texture sampling need more parameters, in particular the full u,v transformation matrix, leading to a large increase in the command size. Since all commands use the same size, that memory penalty is paid by all scenes, not just scenes with textures. - It is unlikely that piet-gpu will support every kind of fill for every client, because each kind must be added to kernel4. With FillImage, a client will prepare the image(s) in separate shader stages, sampling and applying transformations and special effects as needed. Textures that align with the output pixel grid can be used directly, without pre-processing. Note that the pre-processing step can run concurrently with the piet-gpu pipeline; Only the last stage, kernel4, needs the images. Pre-processing most likely uses fixed function vertex/fragment programs, which on some GPUs may run in parallel with piet-gpu's compute programs. While here, fix a few validation errors: - Explicitly enable EXT_descriptor_indexing, KHR_maintenance3, KHR_get_physical_device_properties2. - Specify a vkDescriptorSetVariableDescriptorCountAllocateInfo for vkAllocateDescriptorSets. Otherwise, variable image2D arrays won't work (but sampler2D arrays do, at least on my setup). Updates #38 Signed-off-by: Elias Naur <mail@eliasnaur.com>	2021-03-19 12:50:12 +01:00
Elias Naur	07e07c7544	ensure consistent path segment transformation As described in #62, the non-deterministic scene monoid may result in slightly different transformations for path segments in an otherwise closed path. This change ensures consistent transformation across paths in three steps. First, absolute transformations computed by the scene monoid is stored along with path segments and annotated elements. Second, elements.comp no longer transforms path segments. Instead, each segment is stored untransformed along with a reference to its absolute transformation. Finally, path_coarse performs the transformation of path segments. Because all segments in a path share a single transformation reference, the inconsistency in #62 is avoided. Fixes #62 Signed-off-by: Elias Naur <mail@eliasnaur.com>	2021-03-19 12:45:23 +01:00
Elias Naur	6a4e26ef2a	all: add optional memory checks Defining MEM_DEBUG in mem.h will add a size field to Alloc and enable bounds and alignment checks for every memory read and write. Notes: - Deriving an Alloc from Path.tiles is unsound, but it's more trouble to convert Path.tiles from TileRef to a variable sized Alloc. - elements.comp note that "We should be able to use an array of structs but the NV shader compiler doesn't seem to like it". If that's still relevant, does the shared arrays of Allocs work? Signed-off-by: Elias Naur <mail@eliasnaur.com>	2021-02-15 16:07:45 +01:00
Elias Naur	c4f5a69a0d	implement variable output sizing Signed-off-by: Elias Naur <mail@eliasnaur.com>	2020-12-27 20:24:29 +01:00
Elias Naur	4de67d9081	unify GPU memory management Merge all static and dynamic buffers to just one, "memory". Add a malloc function for dynamic allocations. Unify static allocation offsets into a "config" buffer containing scene setup (number of paths, number of path segments), as well as the memory offsets of the static allocations. Finally, set an overflow flag when an allocation fail, and make sure to exit shader execution as soon as that triggers. Add checks before beginning execution in case the client wants to run two or more shaders before checking the flag. The "state" buffer is left alone because it needs zero'ing and because it is accessed with the "volatile" keyword. Fixes #40 Signed-off-by: Elias Naur <mail@eliasnaur.com>	2020-12-27 20:24:29 +01:00
Elias Naur	d21f2b68de	all: add SPDX license headers Fixes #53 Signed-off-by: Elias Naur <mail@eliasnaur.com>	2020-12-11 18:24:35 +01:00
Elias Naur	ac3ac3ddff	shader: introduce a crude setting for adjusting the maximum workgroup size Both the Vulkan and OpenGL ES spec allow implementations to limit workgroups to 128 threads. Add a LG_WG_FACTOR setting for easy switching between 128 and 256 threads, with 256 being kept as the default setting. Manually tested that LG_WG_FACTOR = 0 (128 threads) works as expected. Signed-off-by: Elias Naur <mail@eliasnaur.com>	2020-09-13 13:04:13 +02:00
Elias Naur	326f7f0d03	shader: delete more unused code and variables Signed-off-by: Elias Naur <mail@eliasnaur.com>	2020-09-13 13:03:56 +02:00
Elias Naur	05636995dd	compute IMAGE_WIDTH and IMAGE_HEIGHT; remove dead code from setup.h Signed-off-by: Elias Naur <mail@eliasnaur.com>	2020-08-29 15:03:40 +02:00
Raph Levien	294f6fd1db	Experiment with new sorting scheme Path segments are unsorted, but other elements are using the same sort-middle approach as before. This is a checkpoint. At this point, there are unoptimized versions of tile init and coarse path raster, but it isn't wired up into a working pipeline. Also observing about a 3x performance regression in element processing, which needs to be investigated.	2020-06-03 09:29:25 -07:00
Raph Levien	a616b4d010	Rework right_edge computation in elements Trying to fit it into the fancy monad doesn't really work, so use a more straightforward approach to compute it from the aggregate. Also add yEdge logic (basically copying piet-metal). With a fix to ELEMENT_BINNING_RATIO (which I had simply gotten wrong), the example renders almost correctly, with small bounding box artifacts.	2020-05-21 10:00:56 -07:00
Raph Levien	03da52cff8	Start implementing fills This should get the "right_edge" value for each segment plumbed through to the binning phase. It also needs to be plumbed to coarse raster and wired up there. Also considering WIP because none of this logic has been tested yet.	2020-05-19 20:40:04 -07:00
Raph Levien	cc89d0e285	Starting coarse rasterizer Working down the pipeline. WIP	2020-05-13 21:39:47 -07:00
Raph Levien	aa83d782ed	Fills Adds fills, and has more or less working tiger render (with artifacts).	2020-05-01 19:42:20 -07:00
Raph Levien	b23fe25177	Use linked list strategy for segments Trying to allocate them contiguously wasn't good.	2020-04-28 22:25:57 -07:00
Raph Levien	cb06b1bc3d	Implement stroked polylines This version seems to work but the allocation of segments has low utilization. Probably best to allocate in chunks rather than try to make them contiguous.	2020-04-28 18:45:59 -07:00
Raph Levien	55e35dd879	Dynamic allocation of intermediate buffers When the initial allocation is exceeded, do an atomic bump allocation. This is done for both tilegroup instances and per tile command lists.	2020-04-25 10:45:47 -07:00
Raph Levien	e1c0e448ef	Encode stroke in scene This just adds the first step of polyline stroking, which is adding it to the scene. Also just a bit of cleaning up of dimensions into one header file.	2020-04-25 08:24:46 -07:00

32 commits