Thanks to Jeff Bolz for spotting the write-after-read hazard on the
sh_flag accesses. This fixes observed failures on Nvidia Turing and
Ampere on DX12.
The MSL translation of the prefix example had its bindings permuted; a
flag prevents this (but, as is typical for shader translation,
potentially creates other problems).
Also use explicit unsigned literal to avoid DXC warnings.
This adds a prefix sum test. This patch is also trying to get a little
more serious about structuring both the test runner (toward the goal of
collecting proper statistics) and pipeline stages for the tests.
Still WIP but giving good results.