Path stream processing

This patch contains the core of the path stream processing, though some integration bits are missing. The core logic is tested, though combinations of path types, transforms, and line widths are not (yet). Progress towards #119
2025-01-07 19:31:31 +11:00 · 2021-11-24 16:26:45 -08:00 · 2021-11-24 16:26:45 -08:00 · 178761dcb3
parent 5ea5c4bb9a
commit 178761dcb3
35 changed files with 1250 additions and 14 deletions
--- a/doc/pathseg.md
+++ b/doc/pathseg.md
@ -0,0 +1,65 @@
+# Path segment encoding
+
+The new (November 2021) element processing pipeline has a particularly clever approach to path segment encoding, and this document explains that.
+
+By way of motivation, in the old scene encoding, all elements take a fixed amount of space, currently 36 bytes, but that's at risk of expanding if a new element type requires even more space. The new design is based on stream compaction. The input is separated into multiple streams, so in particular path segment data gets its own stream. Further, that stream can be packed.
+
+As explained in [#119], the path stream is separated into one stream for tag bytes, and another stream for the path segment data.
+
+## Prefix sum for unpacking
+
+The key to this encoding is a prefix sum over the size of each element's payload. The payload size can be readily derived from the tag byte itself (see below for details on this), then an exclusive prefix sum gives the start offset of the packed encoding for each element. The combination of the tag byte and that offset gives you everything needed to unpack a segment.
+
+## Tag byte encoding
+
+Bits 0-1 indicate the type of path segment: 1 is line, 2 is quadratic bezier, 3 is cubic bezier.
+
+Bit 2 indicates whether this is the last segment in a subpath; see below.
+
+Bit 3 indicates whether the coordinates are i16 or f32.
+
+Thus, values of 1-7 indicate the following combinations in a 16 bit encoding, so `size` counts both points and u32 indices.
+
+```
+value op             size
+    1 lineto         1
+    2 quadto         2
+    3 curveto        3
+    5 lineto + end   2
+    6 quadto + end   3
+    7 curveto + end  4
+```
+
+Values of 9-15 are the same but with a 32 bit encoding, so double `size` to compute the size in u32 units.
+
+A value of 0 indicates no path segment present; it may be a nop, for example padding at the end of the stream to make it an integral number of workgroups, or other bits in the tag byte might indicate a transform, end path, or line width marker (with one bit left for future expansion). Values of 4, 8, and 12 are unused.
+
+In addition to path segments, bits 4-6 are "one hot" encodings of other element types. Bit 4 set (0x10) is a path (encoded after all path segments). Bit 5 set (0x20) is a transform. Bit 6 set (0x40) is a line width setting. Transforms and line widths have their own streams in the encoded scene buffer, so prefix sums of the counts serve as indices into those streams.
+
+### End subpath handling
+
+In the previous encoding, every path segment was encoded independently; the segments could be shuffled within a path without affecting the results. However, that encoding failed to take advantage of the fact that subpaths are continuous, meaning that the start point of each segment is equal to the end point of the previous segment. Thus, there was redundancy in the encoding, and more CPU-side work for the encoder.
+
+This encoding fixes that. Bit 2 of the tag byte indicates whether the segment is the last one in the subpath. If it is set, then the size encompasses all the points in the segment. If not, then it is short one, which leaves the offset for the next segment pointing at the last point in this one.
+
+There is a relatively straightforward state maching to convert the usual moveto/lineto representation to this one. In short, the point for the moveto is encoded, a moveto or closepath sets the end bit for the previously encoded segment (if any), and the end bit is also set for the last segment in the path. Certain cases, such as a lone moveto, must be avoided.
+
+### Bit magic
+
+The encoding is carefully designed for fast calculation based on bits, in particular to quickly compute a sum of counts based on all four tag bytes in a u32.
+
+To count whether a path segment is present, compute `(tag | (tag >> 1)) & 1`. Thus, the number of path segments in a 4-byte word is `bitCount((tag | (tag >> 1)) & 0x1010101)`. Also note: `((tag & 3) * 7) & 4` counts the same number of bits and might save one instruction given that `tag & 3` can be reused below.
+
+The number of points (ie the value of the table above) is `(tag & 3) + ((tag >> 2) & 1)`. The value `(tag >> 3) & 1` is 0 for 16 bit encodings and 1 for 32 bit encodings. Thus, `points + (point & (((tag >> 3) & 1) * 7))` is the number of u32 words. All these operations can be performed in parallel on the 4 bytes in a word, justifying the following code:
+
+```glsl
+    uint point_count = (tag & 0x3030303) + ((tag >> 2) & 0x1010101);
+    uint word_count = point_count + (point_count & (((tag >> 3) & 0x1010101) * 15));
+    word_count += word_count >> 8;
+    word_count += word_count >> 16;
+    word_count &= 0xff;
+```
+
+One possible optimization to explore is packing multiple tags into a byte by or'ing together the flags. This would add a small amount of complexity into the interpretation (mostly in pathseg), and increase utilization a bit.
+
+[#119]: https://github.com/linebender/piet-gpu/issues/119
--- a/piet-gpu-hal/src/hub.rs
+++ b/piet-gpu-hal/src/hub.rs
@ -813,8 +813,8 @@ impl Buffer {
    ) -> Result<BufReadGuard<'a>, Error> {
        let offset = match range.start_bound() {
            Bound::Unbounded => 0,
-            Bound::Excluded(&s) => s.try_into()?,
-            Bound::Included(_) => unreachable!(),
+            Bound::Excluded(_) => unreachable!(),
+            Bound::Included(&s) => s.try_into()?,
        };
        let end = match range.end_bound() {
            Bound::Unbounded => self.size(),
--- a/piet-gpu/shader/backdrop.spv
+++ b/piet-gpu/shader/backdrop.spv
--- a/piet-gpu/shader/backdrop_lg.spv
+++ b/piet-gpu/shader/backdrop_lg.spv
--- a/piet-gpu/shader/bbox_clear.comp
+++ b/piet-gpu/shader/bbox_clear.comp
@ -0,0 +1,29 @@
+// SPDX-License-Identifier: Apache-2.0 OR MIT OR Unlicense
+
+// Clear path bbox to prepare for atomic min/max.
+
+#version 450
+#extension GL_GOOGLE_include_directive : enable
+
+#include "mem.h"
+#include "setup.h"
+
+#define LG_WG_SIZE 9
+#define WG_SIZE (1 << LG_WG_SIZE)
+
+layout(local_size_x = WG_SIZE, local_size_y = 1) in;
+
+layout(binding = 1) readonly buffer ConfigBuf {
+    Config conf;
+};
+
+void main() {
+    uint ix = gl_GlobalInvocationID.x;
+    if (ix < conf.n_elements) {
+        uint out_ix = (conf.bbox_alloc.offset >> 2) + 4 * ix;
+        memory[out_ix] = 0xffff;
+        memory[out_ix + 1] = 0xffff;
+        memory[out_ix + 2] = 0;
+        memory[out_ix + 3] = 0;
+    }
+}
--- a/piet-gpu/shader/binning.spv
+++ b/piet-gpu/shader/binning.spv
--- a/piet-gpu/shader/build.ninja
+++ b/piet-gpu/shader/build.ninja
@ -57,3 +57,12 @@ build gen/transform_leaf.spv: glsl transform_leaf.comp | scene.h tile.h setup.h
 build gen/transform_leaf.hlsl: hlsl gen/transform_leaf.spv
 build gen/transform_leaf.dxil: dxil gen/transform_leaf.hlsl
 build gen/transform_leaf.msl: msl gen/transform_leaf.spv
+
+build gen/pathtag_reduce.spv: glsl pathtag_reduce.comp | pathtag.h setup.h mem.h
+
+build gen/pathtag_root.spv: glsl pathtag_scan.comp | pathtag.h
+  flags = -DROOT
+
+build gen/bbox_clear.spv: glsl bbox_clear.comp | setup.h mem.h
+
+build gen/pathseg.spv: glsl pathseg.comp | tile.h pathseg.h pathtag.h setup.h mem.h
--- a/piet-gpu/shader/coarse.spv
+++ b/piet-gpu/shader/coarse.spv
--- a/piet-gpu/shader/gen/bbox_clear.spv
+++ b/piet-gpu/shader/gen/bbox_clear.spv
--- a/piet-gpu/shader/gen/pathseg.spv
+++ b/piet-gpu/shader/gen/pathseg.spv
--- a/piet-gpu/shader/gen/pathtag_reduce.spv
+++ b/piet-gpu/shader/gen/pathtag_reduce.spv
--- a/piet-gpu/shader/gen/pathtag_root.spv
+++ b/piet-gpu/shader/gen/pathtag_root.spv
--- a/piet-gpu/shader/gen/transform_leaf.dxil
+++ b/piet-gpu/shader/gen/transform_leaf.dxil
--- a/piet-gpu/shader/gen/transform_leaf.hlsl
+++ b/piet-gpu/shader/gen/transform_leaf.hlsl
@ -37,8 +37,12 @@ struct Config
    Alloc pathseg_alloc;
    Alloc anno_alloc;
    Alloc trans_alloc;
+    Alloc bbox_alloc;
    uint n_trans;
    uint trans_offset;
+    uint pathtag_offset;
+    uint linewidth_offset;
+    uint pathseg_offset;
 };

 static const uint3 gl_WorkGroupSize = uint3(512u, 1u, 1u);
@ -144,7 +148,7 @@ void TransformSeg_write(Alloc a, TransformSegRef ref, TransformSeg s)
 void comp_main()
 {
    uint ix = gl_GlobalInvocationID.x * 8u;
-    TransformRef _285 = { _278.Load(44) + (ix * 24u) };
+    TransformRef _285 = { _278.Load(48) + (ix * 24u) };
    TransformRef ref = _285;
    TransformRef param = ref;
    Transform agg = Transform_read(param);
--- a/piet-gpu/shader/gen/transform_leaf.msl
+++ b/piet-gpu/shader/gen/transform_leaf.msl
@ -100,8 +100,12 @@ struct Config
    Alloc_1 pathseg_alloc;
    Alloc_1 anno_alloc;
    Alloc_1 trans_alloc;
+    Alloc_1 bbox_alloc;
    uint n_trans;
    uint trans_offset;
+    uint pathtag_offset;
+    uint linewidth_offset;
+    uint pathseg_offset;
 };

 struct ConfigBuf
--- a/piet-gpu/shader/gen/transform_leaf.spv
+++ b/piet-gpu/shader/gen/transform_leaf.spv
--- a/piet-gpu/shader/gen/transform_reduce.dxil
+++ b/piet-gpu/shader/gen/transform_reduce.dxil
--- a/piet-gpu/shader/gen/transform_reduce.hlsl
+++ b/piet-gpu/shader/gen/transform_reduce.hlsl
@ -26,8 +26,12 @@ struct Config
    Alloc pathseg_alloc;
    Alloc anno_alloc;
    Alloc trans_alloc;
+    Alloc bbox_alloc;
    uint n_trans;
    uint trans_offset;
+    uint pathtag_offset;
+    uint linewidth_offset;
+    uint pathseg_offset;
 };

 static const uint3 gl_WorkGroupSize = uint3(512u, 1u, 1u);
@ -81,7 +85,7 @@ Transform combine_monoid(Transform a, Transform b)
 void comp_main()
 {
    uint ix = gl_GlobalInvocationID.x * 8u;
-    TransformRef _168 = { _161.Load(44) + (ix * 24u) };
+    TransformRef _168 = { _161.Load(48) + (ix * 24u) };
    TransformRef ref = _168;
    TransformRef param = ref;
    Transform agg = Transform_read(param);
--- a/piet-gpu/shader/gen/transform_reduce.msl
+++ b/piet-gpu/shader/gen/transform_reduce.msl
@ -38,8 +38,12 @@ struct Config
    Alloc pathseg_alloc;
    Alloc anno_alloc;
    Alloc trans_alloc;
+    Alloc bbox_alloc;
    uint n_trans;
    uint trans_offset;
+    uint pathtag_offset;
+    uint linewidth_offset;
+    uint pathseg_offset;
 };

 struct ConfigBuf
--- a/piet-gpu/shader/gen/transform_reduce.spv
+++ b/piet-gpu/shader/gen/transform_reduce.spv
--- a/piet-gpu/shader/kernel4.spv
+++ b/piet-gpu/shader/kernel4.spv
--- a/piet-gpu/shader/path_coarse.spv
+++ b/piet-gpu/shader/path_coarse.spv
--- a/piet-gpu/shader/pathseg.comp
+++ b/piet-gpu/shader/pathseg.comp
@ -0,0 +1,284 @@
+// SPDX-License-Identifier: Apache-2.0 OR MIT OR Unlicense
+
+// Processing of the path stream, after the tag scan.
+
+#version 450
+#extension GL_GOOGLE_include_directive : enable
+
+#include "mem.h"
+#include "setup.h"
+#include "pathtag.h"
+
+#define N_SEQ 4
+#define LG_WG_SIZE 9
+#define WG_SIZE (1 << LG_WG_SIZE)
+#define PARTITION_SIZE (WG_SIZE * N_SEQ)
+
+layout(local_size_x = WG_SIZE, local_size_y = 1) in;
+
+layout(binding = 1) readonly buffer ConfigBuf {
+    Config conf;
+};
+
+layout(binding = 2) readonly buffer SceneBuf {
+    uint[] scene;
+};
+
+#include "tile.h"
+#include "pathseg.h"
+
+layout(binding = 3) readonly buffer ParentBuf {
+    TagMonoid[] parent;
+};
+
+struct Monoid {
+    vec4 bbox;
+    uint flags;
+};
+
+#define FLAG_RESET_BBOX 1
+#define FLAG_SET_BBOX 2
+
+Monoid combine_monoid(Monoid a, Monoid b) {
+    Monoid c;
+    c.bbox = b.bbox;
+    // TODO: I think this should be gated on b & SET_BBOX == false also.
+    if ((a.flags & FLAG_RESET_BBOX) == 0 && b.bbox.z <= b.bbox.x && b.bbox.w <= b.bbox.y) {
+        c.bbox = a.bbox;
+    } else if ((a.flags & FLAG_RESET_BBOX) == 0 && (b.flags & FLAG_SET_BBOX) == 0 &&
+        (a.bbox.z > a.bbox.x || a.bbox.w > a.bbox.y))
+    {
+        c.bbox.xy = min(a.bbox.xy, c.bbox.xy);
+        c.bbox.zw = max(a.bbox.zw, c.bbox.zw);
+    }
+    c.flags = (a.flags & FLAG_SET_BBOX) | b.flags;
+    c.flags |= ((a.flags & FLAG_RESET_BBOX) << 1);
+    return c;
+}
+
+Monoid monoid_identity() {
+    return Monoid(vec4(0.0, 0.0, 0.0, 0.0), 0);
+}
+
+// These are not both live at the same time. A very smart shader compiler
+// would be able to figure that out, but I suspect many won't.
+shared TagMonoid sh_tag[WG_SIZE];
+shared Monoid sh_scratch[WG_SIZE];
+
+vec2 read_f32_point(uint ix) {
+    float x = uintBitsToFloat(scene[ix]);
+    float y = uintBitsToFloat(scene[ix + 1]);
+    return vec2(x, y);
+}
+
+vec2 read_i16_point(uint ix) {
+    uint raw = scene[ix];
+    float x = float(int(raw << 16) >> 16);
+    float y = float(int(raw) >> 16);
+    return vec2(x, y);
+}
+
+// Note: these are 16 bit, which is adequate, but we could use 32 bits.
+
+// Round down and saturate to minimum integer; add bias
+uint round_down(float x) {
+    return uint(max(0.0, floor(x) + 32768.0));
+}
+
+// Round up and saturate to maximum integer; add bias
+uint round_up(float x) {
+    return uint(min(65535.0, ceil(x) + 32768.0));
+}
+
+void main() {
+    Monoid local[N_SEQ];
+
+    uint ix = gl_GlobalInvocationID.x * N_SEQ;
+
+    uint tag_word = scene[(conf.pathtag_offset >> 2) + (ix >> 2)];
+
+    // Scan the tag monoid
+    TagMonoid local_tm = reduce_tag(tag_word);
+    sh_tag[gl_LocalInvocationID.x] = local_tm;
+    for (uint i; i < LG_WG_SIZE; i++) {
+        barrier();
+        if (gl_LocalInvocationID.x >= (1u << i)) {
+            TagMonoid other = sh_tag[gl_LocalInvocationID.x - (1u << i)];
+            local_tm = combine_tag_monoid(other, local_tm);
+        }
+        barrier();
+        sh_tag[gl_LocalInvocationID.x] = local_tm;
+    }
+    barrier();
+    // sh_tag is now the partition-wide inclusive scan of the tag monoid.
+    TagMonoid tm = tag_monoid_identity();
+    if (gl_WorkGroupID.x > 0) {
+        tm = parent[gl_WorkGroupID.x - 1];
+    }
+    if (gl_LocalInvocationID.x > 0) {
+        tm = combine_tag_monoid(tm, sh_tag[gl_LocalInvocationID.x - 1]);
+    }
+    // tm is now the full exclusive scan of the tag monoid.
+
+    // Indices to scene buffer in u32 units.
+    uint ps_ix = (conf.pathseg_offset >> 2) + tm.pathseg_offset;
+    uint lw_ix = (conf.linewidth_offset >> 2) + tm.linewidth_ix;
+    uint save_path_ix = tm.path_ix;
+    TransformSegRef trans_ref = TransformSegRef(conf.trans_alloc.offset + tm.trans_ix * TransformSeg_size);
+    PathSegRef ps_ref = PathSegRef(conf.pathseg_alloc.offset + tm.pathseg_ix * PathSeg_size);
+    for (uint i = 0; i < N_SEQ; i++) {
+        // if N_SEQ > 4, need to load tag_word from local if N_SEQ % 4 == 0
+        uint tag_byte = tag_word >> (i * 8);
+        uint seg_type = tag_byte & 3;
+        if (seg_type != 0) {
+            // 1 = line, 2 = quad, 3 = cubic
+            // Unpack path segment from input
+            vec2 p0;
+            vec2 p1;
+            vec2 p2;
+            vec2 p3;
+            if ((tag_byte & 8) != 0) {
+                // 32 bit encoding
+                p0 = read_f32_point(ps_ix);
+                p1 = read_f32_point(ps_ix + 2);
+                if (seg_type >= 2) {
+                    p2 = read_f32_point(ps_ix + 4);
+                    if (seg_type == 3) {
+                        p3 = read_f32_point(ps_ix + 6);
+                    }
+                }
+            } else {
+                // 16 bit encoding
+                p0 = read_i16_point(ps_ix);
+                p1 = read_i16_point(ps_ix + 1);
+                if (seg_type >= 2) {
+                    p2 = read_i16_point(ps_ix + 2);
+                    if (seg_type == 3) {
+                        p3 = read_i16_point(ps_ix + 3);
+                    }
+                }
+            }
+            float linewidth = uintBitsToFloat(scene[lw_ix]);
+            TransformSeg transform = TransformSeg_read(conf.trans_alloc, trans_ref);
+            p0 = transform.mat.xy * p0.x + transform.mat.zw * p0.y + transform.translate;
+            p1 = transform.mat.xy * p1.x + transform.mat.zw * p1.y + transform.translate;
+            vec4 bbox = vec4(min(p0, p1), max(p0, p1));
+            // Degree-raise and compute bbox
+            if (seg_type >= 2) {
+                p2 = transform.mat.xy * p2.x + transform.mat.zw * p2.y + transform.translate;
+                bbox.xy = min(bbox.xy, p2);
+                bbox.zw = max(bbox.zw, p2);
+                if (seg_type == 3) {
+                    p3 = transform.mat.xy * p3.x + transform.mat.zw * p3.y + transform.translate;
+                    bbox.xy = min(bbox.xy, p3);
+                    bbox.zw = max(bbox.zw, p3);
+                } else {
+                    p3 = p2;
+                    p2 = mix(p1, p2, 1.0 / 3.0);
+                    p1 = mix(p1, p0, 1.0 / 3.0);
+                }
+            } else {
+                p3 = p1;
+                p2 = mix(p3, p0, 1.0 / 3.0);
+                p1 = mix(p0, p3, 1.0 / 3.0);
+            }
+            vec2 stroke = vec2(0.0, 0.0);
+            if (linewidth >= 0.0) {
+                // See https://www.iquilezles.org/www/articles/ellipses/ellipses.htm
+                stroke = 0.5 * linewidth * vec2(length(transform.mat.xz), length(transform.mat.yw));
+                bbox += vec4(-stroke, stroke);
+            }
+            local[i].bbox = bbox;
+            local[i].flags = 0;
+
+            // Write path segment to output
+            PathCubic cubic;
+            cubic.p0 = p0;
+            cubic.p1 = p1;
+            cubic.p2 = p2;
+            cubic.p3 = p3;
+            cubic.path_ix = tm.path_ix;
+            // Not needed, TODO remove from struct
+            cubic.trans_ix = gl_GlobalInvocationID.x * 4 + i;
+            cubic.stroke = stroke;
+            uint fill_mode = uint(linewidth >= 0.0);
+            PathSeg_Cubic_write(conf.pathseg_alloc, ps_ref, fill_mode, cubic);
+
+            ps_ref.offset += PathSeg_size;
+            uint n_points = (tag_byte & 3) + ((tag_byte >> 2) & 1);
+            uint n_words = n_points + (n_points & (((tag_byte >> 3) & 1) * 15));
+            ps_ix += n_words;
+        } else {
+            local[i].bbox = vec4(0.0, 0.0, 0.0, 0.0);
+            // These shifts need to be kept in sync with setup.h
+            uint is_path = (tag_byte >> 4) & 1;
+            // Relies on the fact that RESET_BBOX == 1
+            local[i].flags = is_path;
+            tm.path_ix += is_path;
+            trans_ref.offset += ((tag_byte >> 5) & 1) * TransformSeg_size;
+            lw_ix += (tag_byte >> 6) & 1;
+        }
+    }
+
+    // Partition-wide monoid scan for bbox monoid
+    Monoid agg = local[0];
+    for (uint i = 1; i < N_SEQ; i++) {
+        // Note: this could be fused with the map above, but probably
+        // a thin performance gain not worth the complexity.
+        agg = combine_monoid(agg, local[i]);
+        local[i] = agg;
+    }
+    // local is N_SEQ sub-partition inclusive scan of bbox monoid.
+    sh_scratch[gl_LocalInvocationID.x] = agg;
+    for (uint i = 0; i < LG_WG_SIZE; i++) {
+        barrier();
+        if (gl_LocalInvocationID.x >= (1u << i)) {
+            Monoid other = sh_scratch[gl_LocalInvocationID.x - (1u << i)];
+            agg = combine_monoid(other, agg);
+        }
+        barrier();
+        sh_scratch[gl_LocalInvocationID.x] = agg;
+    }
+    // sh_scratch is the partition-wide inclusive scan of the bbox monoid,
+    // sampled at the end of the N_SEQ sub-partition.
+    
+    barrier();
+    uint path_ix = save_path_ix;
+    uint bbox_out_ix = (conf.bbox_alloc.offset >> 2) + path_ix * 4;
+    // Write bboxes to paths; do atomic min/max if partial
+    Monoid row = monoid_identity();
+    if (gl_LocalInvocationID.x > 0) {
+        row = sh_scratch[gl_LocalInvocationID.x - 1];
+    }
+    for (uint i = 0; i < N_SEQ; i++) {
+        Monoid m = combine_monoid(row, local[i]);
+        // m is partition-wide inclusive scan of bbox monoid.
+        bool do_atomic = false;
+        if (i == N_SEQ - 1 && gl_LocalInvocationID.x == WG_SIZE - 1) {
+            // last element
+            do_atomic = true;
+        }
+        if ((m.flags & FLAG_RESET_BBOX) != 0) {
+            if ((m.flags & FLAG_SET_BBOX) == 0) {
+                do_atomic = true;
+            } else {
+                memory[bbox_out_ix] = round_down(m.bbox.x);
+                memory[bbox_out_ix + 1] = round_down(m.bbox.y);
+                memory[bbox_out_ix + 2] = round_up(m.bbox.z);
+                memory[bbox_out_ix + 3] = round_up(m.bbox.w);
+                bbox_out_ix += 4;
+                do_atomic = false;
+            }
+        }
+        if (do_atomic) {
+            if (m.bbox.z > m.bbox.x || m.bbox.w > m.bbox.y) {
+                // atomic min/max
+                atomicMin(memory[bbox_out_ix], round_down(m.bbox.x));
+                atomicMin(memory[bbox_out_ix + 1], round_down(m.bbox.y));
+                atomicMax(memory[bbox_out_ix + 2], round_up(m.bbox.z));
+                atomicMax(memory[bbox_out_ix + 3], round_up(m.bbox.w));
+            }
+            bbox_out_ix += 4;
+        }
+    }
+}
--- a/piet-gpu/shader/pathtag.h
+++ b/piet-gpu/shader/pathtag.h
@ -0,0 +1,49 @@
+// SPDX-License-Identifier: Apache-2.0 OR MIT OR Unlicense
+
+// Common data structures and functions for the path tag stream.
+
+// This is the layout for tag bytes in the path stream. See
+// doc/pathseg.md for an explanation.
+
+#define PATH_TAG_PATHSEG_BITS 0xf
+#define PATH_TAG_PATH 0x10
+#define PATH_TAG_TRANSFORM 0x20
+#define PATH_TAG_LINEWIDTH 0x40
+
+struct TagMonoid {
+    uint trans_ix;
+    uint linewidth_ix;
+    uint pathseg_ix;
+    uint path_ix;
+    uint pathseg_offset;
+};
+
+TagMonoid tag_monoid_identity() {
+    return TagMonoid(0, 0, 0, 0, 0);
+}
+
+TagMonoid combine_tag_monoid(TagMonoid a, TagMonoid b) {
+    TagMonoid c;
+    c.trans_ix = a.trans_ix + b.trans_ix;
+    c.linewidth_ix = a.linewidth_ix + b.linewidth_ix;
+    c.pathseg_ix = a.pathseg_ix + b.pathseg_ix;
+    c.path_ix = a.path_ix + b.path_ix;
+    c.pathseg_offset = a.pathseg_offset + b.pathseg_offset;
+    return c;
+}
+
+TagMonoid reduce_tag(uint tag_word) {
+    TagMonoid c;
+    // Some fun bit magic here, see doc/pathseg.md for explanation.
+    uint point_count = tag_word & 0x3030303;
+    c.pathseg_ix = bitCount((point_count * 7) & 0x4040404);
+    c.linewidth_ix = bitCount(tag_word & (PATH_TAG_LINEWIDTH * 0x1010101));
+    c.path_ix = bitCount(tag_word & (PATH_TAG_PATH * 0x1010101));
+    c.trans_ix = bitCount(tag_word & (PATH_TAG_TRANSFORM * 0x1010101));
+    uint n_points = point_count + ((tag_word >> 2) & 0x1010101);
+    uint a = n_points + (n_points & (((tag_word >> 3) & 0x1010101) * 15));
+    a += a >> 8;
+    a += a >> 16;
+    c.pathseg_offset = a & 0xff;
+    return c;
+}
--- a/piet-gpu/shader/pathtag_reduce.comp
+++ b/piet-gpu/shader/pathtag_reduce.comp
@ -0,0 +1,61 @@
+// SPDX-License-Identifier: Apache-2.0 OR MIT OR Unlicense
+
+// The reduction phase for path tag scan implemented as a tree reduction.
+
+#version 450
+#extension GL_GOOGLE_include_directive : enable
+
+#include "mem.h"
+#include "setup.h"
+#include "pathtag.h"
+
+// Note: the partition size is smaller than pathseg by a factor
+// of 4, as there are 4 tag bytes to a tag word.
+#define N_ROWS 4
+#define LG_WG_SIZE 7
+#define WG_SIZE (1 << LG_WG_SIZE)
+#define PARTITION_SIZE (WG_SIZE * N_ROWS)
+
+layout(local_size_x = WG_SIZE, local_size_y = 1) in;
+
+layout(binding = 1) readonly buffer ConfigBuf {
+    Config conf;
+};
+
+layout(binding = 2) readonly buffer SceneBuf {
+    uint[] scene;
+};
+
+#define Monoid TagMonoid
+
+layout(set = 0, binding = 3) buffer OutBuf {
+    Monoid[] outbuf;
+};
+
+shared Monoid sh_scratch[WG_SIZE];
+
+void main() {
+    uint ix = gl_GlobalInvocationID.x * N_ROWS;
+    uint scene_ix = (conf.pathtag_offset >> 2) + ix;
+    uint tag_word = scene[scene_ix];
+
+    Monoid agg = reduce_tag(tag_word);
+    for (uint i = 1; i < N_ROWS; i++) {
+        tag_word = scene[scene_ix + i];
+        agg = combine_tag_monoid(agg, reduce_tag(tag_word));
+    }
+    sh_scratch[gl_LocalInvocationID.x] = agg;
+    for (uint i = 0; i < LG_WG_SIZE; i++) {
+        barrier();
+        // We could make this predicate tighter, but would it help?
+        if (gl_LocalInvocationID.x + (1u << i) < WG_SIZE) {
+            Monoid other = sh_scratch[gl_LocalInvocationID.x + (1u << i)];
+            agg = combine_tag_monoid(agg, other);
+        }
+        barrier();
+        sh_scratch[gl_LocalInvocationID.x] = agg;
+    }
+    if (gl_LocalInvocationID.x == 0) {
+        outbuf[gl_WorkGroupID.x] = agg;
+    }
+}
--- a/piet-gpu/shader/pathtag_scan.comp
+++ b/piet-gpu/shader/pathtag_scan.comp
@ -0,0 +1,74 @@
+// SPDX-License-Identifier: Apache-2.0 OR MIT OR Unlicense
+
+// A scan for path tag scan implemented as a tree reduction.
+
+#version 450
+#extension GL_GOOGLE_include_directive : enable
+
+#include "pathtag.h"
+
+#define N_ROWS 8
+#define LG_WG_SIZE 9
+#define WG_SIZE (1 << LG_WG_SIZE)
+#define PARTITION_SIZE (WG_SIZE * N_ROWS)
+
+layout(local_size_x = WG_SIZE, local_size_y = 1) in;
+
+#define Monoid TagMonoid
+#define combine_monoid combine_tag_monoid
+#define monoid_identity tag_monoid_identity
+
+layout(binding = 0) buffer DataBuf {
+    Monoid[] data;
+};
+
+#ifndef ROOT
+layout(binding = 1) readonly buffer ParentBuf {
+    Monoid[] parent;
+};
+#endif
+
+shared Monoid sh_scratch[WG_SIZE];
+
+void main() {
+    Monoid local[N_ROWS];
+
+    uint ix = gl_GlobalInvocationID.x * N_ROWS;
+
+    local[0] = data[ix];
+    for (uint i = 1; i < N_ROWS; i++) {
+        local[i] = combine_monoid(local[i - 1], data[ix + i]);
+    }
+    Monoid agg = local[N_ROWS - 1];
+    sh_scratch[gl_LocalInvocationID.x] = agg;
+    for (uint i = 0; i < LG_WG_SIZE; i++) {
+        barrier();
+        if (gl_LocalInvocationID.x >= (1u << i)) {
+            Monoid other = sh_scratch[gl_LocalInvocationID.x - (1u << i)];
+            agg = combine_monoid(other, agg);
+        }
+        barrier();
+        sh_scratch[gl_LocalInvocationID.x] = agg;
+    }
+    
+    barrier();
+    // This could be a semigroup instead of a monoid if we reworked the
+    // conditional logic, but that might impact performance.
+    Monoid row = monoid_identity();
+#ifdef ROOT
+    if (gl_LocalInvocationID.x > 0) {
+        row = sh_scratch[gl_LocalInvocationID.x - 1];
+    }
+#else
+    if (gl_WorkGroupID.x > 0) {
+        row = parent[gl_WorkGroupID.x - 1];
+    }
+    if (gl_LocalInvocationID.x > 0) {
+        row = combine_monoid(row, sh_scratch[gl_LocalInvocationID.x - 1]);
+    }
+#endif
+    for (uint i = 0; i < N_ROWS; i++) {
+        Monoid m = combine_monoid(row, local[i]);
+        data[ix + i] = m;
+    }
+}
--- a/piet-gpu/shader/setup.h
+++ b/piet-gpu/shader/setup.h
@ -40,10 +40,20 @@ struct Config {
    Alloc trans_alloc;
    // new element pipeline stuff follows

+    // Bounding boxes of paths, stored as int (so atomics work)
+    Alloc bbox_alloc;
+
    // Number of transforms in scene
+    // This is probably not needed.
    uint n_trans;
    // Offset (in bytes) of transform stream in scene buffer
    uint trans_offset;
+    // Offset (in bytes) of path tag stream in scene
+    uint pathtag_offset;
+    // Offset (in bytes) of linewidth stream in scene
+    uint linewidth_offset;
+    // Offset (in bytes) of path segment stream in scene
+    uint pathseg_offset;
 };

 // Fill modes.
--- a/piet-gpu/shader/tile_alloc.spv
+++ b/piet-gpu/shader/tile_alloc.spv
--- a/piet-gpu/shader/transform_scan.comp
+++ b/piet-gpu/shader/transform_scan.comp
@ -48,7 +48,6 @@ void main() {

    uint ix = gl_GlobalInvocationID.x * N_ROWS;

-    // TODO: gate buffer read
    local[0] = data[ix];
    for (uint i = 1; i < N_ROWS; i++) {
        local[i] = combine_monoid(local[i - 1], data[ix + i]);
--- a/piet-gpu/src/lib.rs
+++ b/piet-gpu/src/lib.rs
@ -298,7 +298,6 @@ impl Renderer {
        alloc += (n_paths * ANNO_SIZE + 3) & !3;
        let trans_base = alloc;
        alloc += (n_trans * TRANS_SIZE + 3) & !3;
-        let trans_offset = 0; // For new element pipeline, not yet used
        let config = Config {
            n_elements: n_paths as u32,
            n_pathseg: n_pathseg as u32,
@ -311,7 +310,8 @@ impl Renderer {
            anno_alloc: anno_base as u32,
            trans_alloc: trans_base as u32,
            n_trans: n_trans as u32,
-            trans_offset: trans_offset as u32,
+            // We'll fill the rest of the fields in when we hook up the new element pipeline.
+            ..Default::default()
        };
        unsafe {
            let scene = render_ctx.get_scene_buf();
--- a/piet-gpu/src/stages.rs
+++ b/piet-gpu/src/stages.rs
@ -16,6 +16,8 @@

 //! Stages for new element pipeline, exposed for testing.

+mod path;
+
 use bytemuck::{Pod, Zeroable};

 use piet::kurbo::Affine;
@ -23,6 +25,8 @@ use piet_gpu_hal::{
    include_shader, BindType, Buffer, BufferUsage, CmdBuf, DescriptorSet, Pipeline, Session,
 };

+pub use path::{PathBinding, PathCode, PathEncoder, PathStage};
+
 /// The configuration block passed to piet-gpu shaders.
 ///
 /// Note: this should be kept in sync with the version in setup.h.
@ -39,8 +43,12 @@ pub struct Config {
    pub pathseg_alloc: u32,
    pub anno_alloc: u32,
    pub trans_alloc: u32,
+    pub bbox_alloc: u32,
    pub n_trans: u32,
    pub trans_offset: u32,
+    pub pathtag_offset: u32,
+    pub linewidth_offset: u32,
+    pub pathseg_offset: u32,
 }

 // The individual stages will probably be separate files but for now, all in one.
--- a/piet-gpu/src/stages/path.rs
+++ b/piet-gpu/src/stages/path.rs
@ -0,0 +1,339 @@
+// Copyright 2021 The piet-gpu authors.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+// Also licensed under MIT license, at your choice.
+
+//! The path stage (includes substages).
+
+use piet_gpu_hal::{
+    BindType, Buffer, BufferUsage, CmdBuf, DescriptorSet, Pipeline, Session, ShaderCode,
+};
+
+pub struct PathCode {
+    reduce_pipeline: Pipeline,
+    tag_root_pipeline: Pipeline,
+    clear_pipeline: Pipeline,
+    pathseg_pipeline: Pipeline,
+}
+
+pub struct PathStage {
+    tag_root_buf: Buffer,
+    tag_root_ds: DescriptorSet,
+}
+
+pub struct PathBinding {
+    reduce_ds: DescriptorSet,
+    clear_ds: DescriptorSet,
+    path_ds: DescriptorSet,
+}
+
+const REDUCE_WG: u32 = 128;
+const REDUCE_N_ROWS: u32 = 4;
+const REDUCE_PART_SIZE: u32 = REDUCE_WG * REDUCE_N_ROWS;
+
+const ROOT_WG: u32 = 512;
+const ROOT_N_ROWS: u32 = 8;
+const ROOT_PART_SIZE: u32 = ROOT_WG * ROOT_N_ROWS;
+
+const SCAN_WG: u32 = 512;
+const SCAN_N_ROWS: u32 = 4;
+const SCAN_PART_SIZE: u32 = SCAN_WG * SCAN_N_ROWS;
+
+const CLEAR_WG: u32 = 512;
+
+impl PathCode {
+    pub unsafe fn new(session: &Session) -> PathCode {
+        // TODO: add cross-compilation
+        let reduce_code = ShaderCode::Spv(include_bytes!("../../shader/gen/pathtag_reduce.spv"));
+        let reduce_pipeline = session
+            .create_compute_pipeline(
+                reduce_code,
+                &[
+                    BindType::Buffer,
+                    BindType::BufReadOnly,
+                    BindType::BufReadOnly,
+                    BindType::Buffer,
+                ],
+            )
+            .unwrap();
+        let tag_root_code = ShaderCode::Spv(include_bytes!("../../shader/gen/pathtag_root.spv"));
+        let tag_root_pipeline = session
+            .create_compute_pipeline(tag_root_code, &[BindType::Buffer])
+            .unwrap();
+        let clear_code = ShaderCode::Spv(include_bytes!("../../shader/gen/bbox_clear.spv"));
+        let clear_pipeline = session
+            .create_compute_pipeline(clear_code, &[BindType::Buffer, BindType::BufReadOnly])
+            .unwrap();
+        let pathseg_code = ShaderCode::Spv(include_bytes!("../../shader/gen/pathseg.spv"));
+        let pathseg_pipeline = session
+            .create_compute_pipeline(
+                pathseg_code,
+                &[
+                    BindType::Buffer,
+                    BindType::BufReadOnly,
+                    BindType::BufReadOnly,
+                    BindType::BufReadOnly,
+                ],
+            )
+            .unwrap();
+        PathCode {
+            reduce_pipeline,
+            tag_root_pipeline,
+            clear_pipeline,
+            pathseg_pipeline,
+        }
+    }
+}
+
+impl PathStage {
+    pub unsafe fn new(session: &Session, code: &PathCode) -> PathStage {
+        let tag_root_buf_size = (ROOT_PART_SIZE * 20) as u64;
+        let tag_root_buf = session
+            .create_buffer(tag_root_buf_size, BufferUsage::STORAGE)
+            .unwrap();
+        let tag_root_ds = session
+            .create_simple_descriptor_set(&code.tag_root_pipeline, &[&tag_root_buf])
+            .unwrap();
+        PathStage {
+            tag_root_buf,
+            tag_root_ds,
+        }
+    }
+
+    pub unsafe fn bind(
+        &self,
+        session: &Session,
+        code: &PathCode,
+        config_buf: &Buffer,
+        scene_buf: &Buffer,
+        memory_buf: &Buffer,
+    ) -> PathBinding {
+        let reduce_ds = session
+            .create_simple_descriptor_set(
+                &code.reduce_pipeline,
+                &[memory_buf, config_buf, scene_buf, &self.tag_root_buf],
+            )
+            .unwrap();
+        let clear_ds = session
+            .create_simple_descriptor_set(&code.clear_pipeline, &[memory_buf, config_buf])
+            .unwrap();
+        let path_ds = session
+            .create_simple_descriptor_set(
+                &code.pathseg_pipeline,
+                &[memory_buf, config_buf, scene_buf, &self.tag_root_buf],
+            )
+            .unwrap();
+        PathBinding {
+            reduce_ds,
+            clear_ds,
+            path_ds,
+        }
+    }
+
+    /// Record the path stage.
+    ///
+    /// Note: no barrier is needed for transform output, we have a barrier before
+    /// those are consumed. Result is written without barrier.
+    pub unsafe fn record(
+        &self,
+        cmd_buf: &mut CmdBuf,
+        code: &PathCode,
+        binding: &PathBinding,
+        n_paths: u32,
+        n_tags: u32,
+    ) {
+        if n_tags > ROOT_PART_SIZE * SCAN_PART_SIZE {
+            println!(
+                "number of pathsegs exceeded {} > {}",
+                n_tags,
+                ROOT_PART_SIZE * SCAN_PART_SIZE
+            );
+        }
+
+        // Number of tags consumed in a tag reduce workgroup
+        let reduce_part_tags = REDUCE_PART_SIZE * 4;
+        let n_wg_tag_reduce = (n_tags + reduce_part_tags - 1) / reduce_part_tags;
+        if n_wg_tag_reduce > 1 {
+            cmd_buf.dispatch(
+                &code.reduce_pipeline,
+                &binding.reduce_ds,
+                (n_wg_tag_reduce, 1, 1),
+                (REDUCE_WG, 1, 1),
+            );
+            // I think we can skip root if n_wg_tag_reduce == 2
+            cmd_buf.memory_barrier();
+            cmd_buf.dispatch(
+                &code.tag_root_pipeline,
+                &self.tag_root_ds,
+                (1, 1, 1),
+                (ROOT_WG, 1, 1),
+            );
+            // No barrier needed here; clear doesn't depend on path tags
+        }
+        let n_wg_clear = (n_paths + CLEAR_WG - 1) / CLEAR_WG;
+        cmd_buf.dispatch(
+            &code.clear_pipeline,
+            &binding.clear_ds,
+            (n_wg_clear, 1, 1),
+            (CLEAR_WG, 1, 1),
+        );
+        cmd_buf.memory_barrier();
+        let n_wg_pathseg = (n_tags + SCAN_PART_SIZE - 1) / SCAN_PART_SIZE;
+        cmd_buf.dispatch(
+            &code.pathseg_pipeline,
+            &binding.path_ds,
+            (n_wg_pathseg, 1, 1),
+            (SCAN_WG, 1, 1),
+        );
+    }
+}
+
+pub struct PathEncoder<'a> {
+    tag_stream: &'a mut Vec<u8>,
+    // If we're never going to use the i16 encoding, it might be
+    // slightly faster to store this as Vec<u32>, we'd get aligned
+    // stores on ARM etc.
+    pathseg_stream: &'a mut Vec<u8>,
+    first_pt: [f32; 2],
+    state: State,
+    n_pathseg: u32,
+}
+
+#[derive(PartialEq)]
+enum State {
+    Start,
+    MoveTo,
+    NonemptySubpath,
+}
+
+impl<'a> PathEncoder<'a> {
+    pub fn new(tags: &'a mut Vec<u8>, pathsegs: &'a mut Vec<u8>) -> PathEncoder<'a> {
+        PathEncoder {
+            tag_stream: tags,
+            pathseg_stream: pathsegs,
+            first_pt: [0.0, 0.0],
+            state: State::Start,
+            n_pathseg: 0,
+        }
+    }
+
+    pub fn move_to(&mut self, x: f32, y: f32) {
+        let buf = [x, y];
+        let bytes = bytemuck::bytes_of(&buf);
+        self.first_pt = buf;
+        if self.state == State::MoveTo {
+            let new_len = self.pathseg_stream.len() - 8;
+            self.pathseg_stream.truncate(new_len);
+        }
+        if self.state == State::NonemptySubpath {
+            if let Some(tag) = self.tag_stream.last_mut() {
+                *tag |= 4;
+            }
+        }
+        self.pathseg_stream.extend_from_slice(bytes);
+        self.state = State::MoveTo;
+    }
+
+    pub fn line_to(&mut self, x: f32, y: f32) {
+        if self.state == State::Start {
+            // should warn or error
+            return;
+        }
+        let buf = [x, y];
+        let bytes = bytemuck::bytes_of(&buf);
+        self.pathseg_stream.extend_from_slice(bytes);
+        self.tag_stream.push(9);
+        self.state = State::NonemptySubpath;
+        self.n_pathseg += 1;
+    }
+
+    pub fn quad_to(&mut self, x0: f32, y0: f32, x1: f32, y1: f32) {
+        if self.state == State::Start {
+            return;
+        }
+        let buf = [x0, y0, x1, y1];
+        let bytes = bytemuck::bytes_of(&buf);
+        self.pathseg_stream.extend_from_slice(bytes);
+        self.tag_stream.push(10);
+        self.state = State::NonemptySubpath;
+        self.n_pathseg += 1;
+    }
+
+    pub fn cubic_to(&mut self, x0: f32, y0: f32, x1: f32, y1: f32, x2: f32, y2: f32) {
+        if self.state == State::Start {
+            return;
+        }
+        let buf = [x0, y0, x1, y1, x2, y2];
+        let bytes = bytemuck::bytes_of(&buf);
+        self.pathseg_stream.extend_from_slice(bytes);
+        self.tag_stream.push(11);
+        self.state = State::NonemptySubpath;
+        self.n_pathseg += 1;
+    }
+
+    pub fn close_path(&mut self) {
+        match self.state {
+            State::Start => return,
+            State::MoveTo => {
+                let new_len = self.pathseg_stream.len() - 8;
+                self.pathseg_stream.truncate(new_len);
+                return;
+            }
+            State::NonemptySubpath => (),
+        }
+        let len = self.pathseg_stream.len();
+        if len < 8 {
+            // can't happen
+            return;
+        }
+        let first_bytes = bytemuck::bytes_of(&self.first_pt);
+        if &self.pathseg_stream[len - 8..len] != first_bytes {
+            self.pathseg_stream.extend_from_slice(first_bytes);
+            self.tag_stream.push(13);
+            self.n_pathseg += 1;
+        } else {
+            if let Some(tag) = self.tag_stream.last_mut() {
+                *tag |= 4;
+            }
+        }
+        self.state = State::Start;
+    }
+
+    fn finish(&mut self) {
+        if self.state == State::MoveTo {
+            let new_len = self.pathseg_stream.len() - 8;
+            self.pathseg_stream.truncate(new_len);
+        }
+        if let Some(tag) = self.tag_stream.last_mut() {
+            *tag |= 4;
+        }
+    }
+
+    /// Finish encoding a path.
+    ///
+    /// Encode this after encoding path segments.
+    pub fn path(&mut self) {
+        self.finish();
+        // maybe don't encode if path is empty? might throw off sync though
+        self.tag_stream.push(0x10);
+    }
+
+    /// Get the number of path segments.
+    ///
+    /// This is the number of path segments that will be written by the
+    /// path stage; use this for allocating the output buffer.
+    pub fn n_pathseg(&self) -> u32 {
+        self.n_pathseg
+    }
+}
--- a/tests/src/main.rs
+++ b/tests/src/main.rs
@ -25,6 +25,8 @@ mod prefix_tree;
 mod runner;
 mod test_result;

+#[cfg(feature = "piet-gpu")]
+mod path;
 #[cfg(feature = "piet-gpu")]
 mod transform;

@ -134,6 +136,7 @@ fn main() {
        #[cfg(feature = "piet-gpu")]
        if config.groups.matches("piet") {
            report(&transform::transform_test(&mut runner, &config));
+            report(&path::path_test(&mut runner, &config));
        }
    }
 }
--- a/tests/src/path.rs
+++ b/tests/src/path.rs
@ -0,0 +1,293 @@
+// Copyright 2021 The piet-gpu authors.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+// Also licensed under MIT license, at your choice.
+
+//! Tests for the piet-gpu path stage.
+
+use crate::{Config, Runner, TestResult};
+
+use bytemuck::{Pod, Zeroable};
+use piet_gpu::stages::{self, PathCode, PathEncoder, PathStage};
+use piet_gpu_hal::{BufWrite, BufferUsage};
+use rand::{prelude::ThreadRng, Rng};
+
+struct PathData {
+    n_trans: u32,
+    n_linewidth: u32,
+    n_path: u32,
+    n_pathseg: u32,
+    tags: Vec<u8>,
+    pathsegs: Vec<u8>,
+    bbox: Vec<(f32, f32, f32, f32)>,
+    lines: Vec<([f32; 2], [f32; 2])>,
+}
+
+// This is designed to match pathseg.h
+#[repr(C)]
+#[derive(Clone, Copy, Debug, Default, Zeroable, Pod)]
+struct PathSeg {
+    tag: u32,
+    p0: [f32; 2],
+    p1: [f32; 2],
+    p2: [f32; 2],
+    p3: [f32; 2],
+    path_ix: u32,
+    trans_ix: u32,
+    stroke: [f32; 2],
+}
+
+#[repr(C)]
+#[derive(Clone, Copy, Debug, Default, PartialEq, Zeroable, Pod)]
+struct Bbox {
+    left: u32,
+    top: u32,
+    right: u32,
+    bottom: u32,
+}
+
+pub unsafe fn path_test(runner: &mut Runner, config: &Config) -> TestResult {
+    let mut result = TestResult::new("path");
+
+    let n_path: u64 = config.size.choose(1 << 12, 1 << 16, 1 << 18);
+    let path_data = PathData::new(n_path as u32);
+    let stage_config = path_data.get_config();
+    let config_buf = runner
+        .session
+        .create_buffer_init(std::slice::from_ref(&stage_config), BufferUsage::STORAGE)
+        .unwrap();
+    let scene_size = n_path * 256;
+    let scene_buf = runner
+        .session
+        .create_buffer_with(
+            scene_size,
+            |b| path_data.fill_scene(b),
+            BufferUsage::STORAGE,
+        )
+        .unwrap();
+    let memory_init = runner
+        .session
+        .create_buffer_with(
+            path_data.memory_init_size(),
+            |b| path_data.fill_memory(b),
+            BufferUsage::COPY_SRC,
+        )
+        .unwrap();
+    let memory = runner.buf_down(path_data.memory_full_size(), BufferUsage::empty());
+
+    let code = PathCode::new(&runner.session);
+    let stage = PathStage::new(&runner.session, &code);
+    let binding = stage.bind(
+        &runner.session,
+        &code,
+        &config_buf,
+        &scene_buf,
+        &memory.dev_buf,
+    );
+
+    let mut total_elapsed = 0.0;
+    let n_iter = config.n_iter;
+    for i in 0..n_iter {
+        let mut commands = runner.commands();
+        commands.cmd_buf.copy_buffer(&memory_init, &memory.dev_buf);
+        commands.cmd_buf.memory_barrier();
+        commands.write_timestamp(0);
+        stage.record(
+            &mut commands.cmd_buf,
+            &code,
+            &binding,
+            path_data.n_path,
+            path_data.tags.len() as u32,
+        );
+        commands.write_timestamp(1);
+        if i == 0 || config.verify_all {
+            commands.cmd_buf.memory_barrier();
+            commands.download(&memory);
+        }
+        total_elapsed += runner.submit(commands);
+        if i == 0 || config.verify_all {
+            let dst = memory.map_read(..);
+            if let Some(failure) = path_data.verify(&dst) {
+                result.fail(failure);
+            }
+        }
+    }
+    let n_elements = path_data.n_pathseg as u64;
+    result.timing(total_elapsed, n_elements * n_iter);
+
+    result
+}
+
+fn rand_point(rng: &mut ThreadRng) -> (f32, f32) {
+    let x = rng.gen_range(0.0, 100.0);
+    let y = rng.gen_range(0.0, 100.0);
+    (x, y)
+}
+
+// Must match shader/pathseg.h
+const PATHSEG_SIZE: u32 = 52;
+
+impl PathData {
+    fn new(n_path: u32) -> PathData {
+        let mut rng = rand::thread_rng();
+        let n_trans = 1;
+        let n_linewidth = 1;
+        let segments_per_path = 8;
+        let mut tags = Vec::new();
+        let mut pathsegs = Vec::new();
+        let mut bbox = Vec::new();
+        let mut lines = Vec::new();
+        let mut encoder = PathEncoder::new(&mut tags, &mut pathsegs);
+        for _ in 0..n_path {
+            let (x, y) = rand_point(&mut rng);
+            let mut min_x = x;
+            let mut max_x = x;
+            let mut min_y = y;
+            let mut max_y = y;
+            let first_pt = [x, y];
+            let mut last_pt = [x, y];
+            encoder.move_to(x, y);
+            for _ in 0..segments_per_path {
+                let (x, y) = rand_point(&mut rng);
+                lines.push((last_pt, [x, y]));
+                last_pt = [x, y];
+                encoder.line_to(x, y);
+                min_x = min_x.min(x);
+                max_x = max_x.max(x);
+                min_y = min_y.min(y);
+                max_y = max_y.max(y);
+            }
+            bbox.push((min_x, min_y, max_x, max_y));
+            encoder.close_path();
+            // With very low probability last_pt and first_pt might be equal, which
+            // would cause a test failure - might want to seed RNG.
+            lines.push((last_pt, first_pt));
+            encoder.path();
+        }
+        let n_pathseg = encoder.n_pathseg();
+        //println!("tags: {:x?}", &tags[0..8]);
+        //println!("path: {:?}", bytemuck::cast_slice::<u8, f32>(&pathsegs[0..64]));
+        PathData {
+            n_trans,
+            n_linewidth,
+            n_path,
+            n_pathseg,
+            tags,
+            pathsegs,
+            bbox,
+            lines,
+        }
+    }
+
+    fn get_config(&self) -> stages::Config {
+        let n_trans = self.n_trans;
+
+        // Layout of scene buffer
+        let linewidth_offset = 0;
+        let pathtag_offset = linewidth_offset + self.n_linewidth * 4;
+        let n_tagbytes = self.tags.len() as u32;
+        // Depends on workgroup size, maybe get from stages?
+        let padded_n_tagbytes = (n_tagbytes + 2047) & !2047;
+        let pathseg_offset = pathtag_offset + padded_n_tagbytes;
+
+        // Layout of memory
+        let trans_alloc = 0;
+        let pathseg_alloc = trans_alloc + n_trans * 24;
+        let bbox_alloc = pathseg_alloc + self.n_pathseg * PATHSEG_SIZE;
+        let stage_config = stages::Config {
+            n_elements: self.n_path,
+            pathseg_alloc,
+            trans_alloc,
+            bbox_alloc,
+            n_trans,
+            pathtag_offset,
+            linewidth_offset,
+            pathseg_offset,
+            ..Default::default()
+        };
+        stage_config
+    }
+
+    fn fill_scene(&self, buf: &mut BufWrite) {
+        let linewidth = -1.0f32;
+        buf.push(linewidth);
+        buf.extend_slice(&self.tags);
+        buf.fill_zero(self.tags.len().wrapping_neg() & 2047);
+        buf.extend_slice(&self.pathsegs);
+    }
+
+    fn memory_init_size(&self) -> u64 {
+        let mut size = 8; // offset and error
+        size += self.n_trans * 24;
+        size as u64
+    }
+
+    fn memory_full_size(&self) -> u64 {
+        let mut size = self.memory_init_size();
+        size += (self.n_pathseg * PATHSEG_SIZE) as u64;
+        size += (self.n_path * 16) as u64;
+        size
+    }
+
+    fn fill_memory(&self, buf: &mut BufWrite) {
+        // This stage is not dynamically allocating memory
+        let mem_offset = 0u32;
+        let mem_error = 0u32;
+        let mem_init = [mem_offset, mem_error];
+        buf.push(mem_init);
+        let trans = [1.0f32, 0.0, 0.0, 1.0, 0.0, 0.0];
+        buf.push(trans);
+    }
+
+    fn verify(&self, memory: &[u8]) -> Option<String> {
+        fn round_down(x: f32) -> u32 {
+            (x.floor() + 32768.0) as u32
+        }
+        fn round_up(x: f32) -> u32 {
+            (x.ceil() + 32768.0) as u32
+        }
+        let begin_pathseg = 32;
+        for i in 0..self.n_pathseg {
+            let offset = (begin_pathseg + PATHSEG_SIZE * i) as usize;
+            let actual =
+                bytemuck::from_bytes::<PathSeg>(&memory[offset..offset + PATHSEG_SIZE as usize]);
+            let expected = self.lines[i as usize];
+            const EPSILON: f32 = 1e-9;
+            if (expected.0[0] - actual.p0[0]).abs() > EPSILON
+                || (expected.0[1] - actual.p0[1]).abs() > EPSILON
+                || (expected.1[0] - actual.p3[0]).abs() > EPSILON
+                || (expected.1[1] - actual.p3[1]).abs() > EPSILON
+            {
+                println!("{}: {:.1?} {:.1?}", i, actual, expected);
+            }
+        }
+        let begin_bbox = 32 + PATHSEG_SIZE * self.n_pathseg;
+        for i in 0..self.n_path {
+            let offset = (begin_bbox + 16 * i) as usize;
+            let actual = bytemuck::from_bytes::<Bbox>(&memory[offset..offset + 16]);
+            let expected_f32 = self.bbox[i as usize];
+            let expected = Bbox {
+                left: round_down(expected_f32.0),
+                top: round_down(expected_f32.1),
+                right: round_up(expected_f32.2),
+                bottom: round_up(expected_f32.3),
+            };
+            if expected != *actual {
+                println!("{}: {:?} {:?}", i, actual, expected);
+                return Some(format!("bbox mismatch at {}", i));
+            }
+        }
+        None
+    }
+}
--- a/tests/src/transform.rs
+++ b/tests/src/transform.rs
@ -14,7 +14,7 @@
 //
 // Also licensed under MIT license, at your choice.

-//! Tests for piet-gpu shaders.
+//! Tests for the piet-gpu transform stage.

 use crate::{Config, Runner, TestResult};

@ -37,11 +37,9 @@ pub unsafe fn transform_test(runner: &mut Runner, config: &Config) -> TestResult
        .session
        .create_buffer_init(&data.input_data, BufferUsage::STORAGE)
        .unwrap();
-    let memory = runner.buf_down(data_buf.size() + 24, BufferUsage::empty());
+    let memory = runner.buf_down(data_buf.size() + 8, BufferUsage::empty());
    let stage_config = stages::Config {
        n_trans: n_elements as u32,
-        // This is a hack to get elements aligned.
-        trans_alloc: 16,
        ..Default::default()
    };
    let config_buf = runner
@ -71,9 +69,8 @@ pub unsafe fn transform_test(runner: &mut Runner, config: &Config) -> TestResult
        }
        total_elapsed += runner.submit(commands);
        if i == 0 || config.verify_all {
-            let mut dst: Vec<Transform> = Default::default();
-            memory.read(&mut dst);
-            if let Some(failure) = data.verify(&dst[1..]) {
+            let dst = memory.map_read(8..);
+            if let Some(failure) = data.verify(dst.cast_slice()) {
                result.fail(failure);
            }
        }