Also add finish_timestamps call, which is needed for DX12 (there are other issues but this is an easy fix for that one).
Do a tree reduction in addition to the existing decoupled look-back, to explore the tradeoff between performance and compatibility.
This adds a prefix sum test. This patch is also trying to get a little more serious about structuring both the test runner (toward the goal of collecting proper statistics) and pipeline stages for the tests. Still WIP but giving good results.