Thanks to Jeff Bolz for spotting the write-after-read hazard on the
sh_flag accesses. This fixes observed failures on Nvidia Turing and
Ampere on DX12.
This adds both regular and Vulkan memory model atomic versions of the
prefix sum test, compiled by #ifdef. The build chain is getting messy,
but I think it's important to test this stuff.