This commit is contained in:
Lokathor 2018-11-28 18:18:23 -07:00
parent 82cb153b97
commit 6ae1374412
3 changed files with 183 additions and 92 deletions

View file

@ -142,8 +142,10 @@ Absolutely not. Do you need it for pokemon? No, not even then, but a lot of the
hot new PRNGs have come out just within the past 10 years, so we can't fault hot new PRNGs have come out just within the past 10 years, so we can't fault
them too much for it. them too much for it.
Note that generators that aren't uniform to begin with naturally don't have any TLDR: 1-dimensional equidistribution just means "a normal uniform generator",
amount of k-Dimensional Equidistribution. and higher k values mean "you can actually combine up to k output chains and
maintain uniformity". Generators that aren't uniform to begin with effectively
have a k value of 0.
### Other Tricks ### Other Tricks
@ -171,7 +173,62 @@ all. This is the basis for how old games with limited memory like
## How To Seed ## How To Seed
TODO timers yo! Oh I bet you thought we could somehow get through a section without learning
about yet another IO register. Ha, wishful thinking.
There's actually not much involved. Starting at `0x400_0100` there's an array of
registers that go "data", "control", "data", "control", etc. TONC and GBATEK use
different names here, and we'll go by the TONC names because they're much
clearer:
```rust
pub const TM0D: VolatilePtr<u16> = VolatilePtr(0x400_0100 as *mut u16);
pub const TM0CNT: VolatilePtr<u16> = VolatilePtr(0x400_0102 as *mut u16);
pub const TM1D: VolatilePtr<u16> = VolatilePtr(0x400_0104 as *mut u16);
pub const TM1CNT: VolatilePtr<u16> = VolatilePtr(0x400_0106 as *mut u16);
pub const TM2D: VolatilePtr<u16> = VolatilePtr(0x400_0108 as *mut u16);
pub const TM2CNT: VolatilePtr<u16> = VolatilePtr(0x400_010A as *mut u16);
pub const TM3D: VolatilePtr<u16> = VolatilePtr(0x400_010C as *mut u16);
pub const TM3CNT: VolatilePtr<u16> = VolatilePtr(0x400_010E as *mut u16);
```
Basically there's 4 timers, numbered 0 to 3. Each one has a Data register and a
Control register. They're all `u16` and you can definitely _read_ from all of
them normally, but then it gets a little weird. You can also _write_ to the
Control portions normally, when you write to the Data portion of a timer that
writes the value that the timer resets to, _without changing_ its current Data
value. So if `TM0D` is paused on some value other than `5` and you write `5` to
it, when you read it back you won't get a `5`. When the next timer run starts
it'll begin counting at `5` instead of whatever value it currently reads as.
The Data registers are just a `u16` number, no special bits to know about.
The Control registers are also pretty simple compared to most IO registers:
* 2 bits for the **Frequency:** 1, 64, 256, 1024. While active, the timer's
value will tick up once every `frequency` CPU cycles. On the GBA, 1 CPU cycle
is about 59.59ns (2^(-24) seconds). One display controller cycle is 280,896
CPU cycles.
* 1 bit for **Cascade Mode:** If this is on the timer doesn't count on its own,
instead it ticks up whenever the _preceding_ timer overflows its counter (eg:
if t0 overflows, t1 will tick up if it's in cascade mode). You still have to
also enable this timer for it to do that (below). This naturally doesn't have
an effect when used with timer 0.
* 3 bits that do nothing
* 1 bit for **Interrupt:** Whenever this timer overflows it will signal an
interrupt.
* 1 bit to **Enable** the timer. When you disable a timer it retains the current
value, but when you enable it again the value jumps to whatever its currently
assigned default value is.
TODO timer control struct / methods
### A Timer Based Seed
TODO turn on 2+ timers with cascading when the game turns on and wait for a key press
## Various Generators ## Various Generators
@ -182,7 +239,10 @@ cute to use. It's the PRNG that Super Mario 64 had ([video explanation,
long](https://www.youtube.com/watch?v=MiuLeTE2MeQ)). long](https://www.youtube.com/watch?v=MiuLeTE2MeQ)).
With a PRNG this simple the output of one call is _also_ the seed to the next With a PRNG this simple the output of one call is _also_ the seed to the next
call. call, so we don't need to make a struct for it or anything. You're also assumed
to just seed with a plain 0 value at startup. The generator has a painfully
small period, and you're assumed to be looping through the state space
constantly while the RNG goes.
```rust ```rust
pub fn sm64(mut input: u16) -> u16 { pub fn sm64(mut input: u16) -> u16 {
@ -209,11 +269,13 @@ pub fn sm64(mut input: u16) -> u16 {
[Compiler Explorer](https://rust.godbolt.org/z/1F6P8L) [Compiler Explorer](https://rust.godbolt.org/z/1F6P8L)
If you watch the video you'll note that the first `if` checking for `0x560A` is If you watch the video explanation about this generator you'll note that the
only potentially important to avoid being locked in a 2-step cycle, though if first `if` checking for `0x560A` prevents you from being locked into a 2-step
you can guarantee that you'll never pass a bad input value I suppose you could cycle, but it's only important if you want to feed bad seeds to the generator. A
eliminate it. The second `if` that checks for `0xAA55` doesn't seem to be bad seed is unhelpfully defined defined as "any value that the generator can't
important at all from a mathematical perspective. It's left in there only for output". The second `if` that checks for `0xAA55` doesn't seem to be important
at all from a mathematical perspective. It cuts the generator's period shorter
by an arbitrary amount for no known reason. It's left in there only for
authenticity. authenticity.
### LCG32 (32-bit state, 32-bit output, uniform) ### LCG32 (32-bit state, 32-bit output, uniform)
@ -274,14 +336,16 @@ pub fn lcg_streaming(seed: u32, stream: u32) -> u32 {
} }
``` ```
With a streaming LCG you should _probably_ pass the same stream value every With a streaming LCG you should pass the same stream value every single time. If
single time. If you don't, then your generator will jump between streams in some you don't, then your generator will jump between streams in some crazy way and
crazy way and you lose your nice uniformity properties. you lose your nice uniformity properties.
However, there is also the possibility of changing the stream value exactly when There is the possibility of intentionally changing the stream value exactly when
the seed lands on a pre-determined value after transformation. We need to keep the seed lands on a pre-determined value (after the multiply and add). This
odd stream values, and we would like to ensure our stream performs a full cycle _basically_ makes the stream selection value's bit size (minus one bit, because
itself, so we'll just add 2 for simplicity: it must be odd) count into the LCG's state bit size for calculating the overall
period of the generator. So an LCG32 with a 32-bit stream selection would have a
period of 2^32 * 2^31 = 2^63.
```rust ```rust
let next_seed = lcg_streaming(seed, stream); let next_seed = lcg_streaming(seed, stream);
@ -291,38 +355,22 @@ if seed == 0 {
} }
``` ```
If you adjust streams at a fixed time like that then you end up going cleanly However, this isn't a particularly effective way to extend the generator's
through one stream cycle, and then the next stream cycle, and so on. This lets period, and we'll see a much better extension technique below.
you have a vastly increased generator period for minimal additional overhead.
The bit size of your generator's increment value type (minus 1, since the 1s bit
must always be odd) gets directly multiplied into your base generator's period
(2^state_size, for LCGs and PCGs). So an LCG32 with a 32-bit stream selection
would have a period of 2^32 * 2^31 = 2^63.
### PCG16 XSH-RR (32-bit state, 16-bit output, uniform) ### PCG16 XSH-RS (32-bit state, 16-bit output, uniform)
The [Permuted Congruential The [Permuted Congruential
Generator](https://en.wikipedia.org/wiki/Permuted_congruential_generator) family Generator](https://en.wikipedia.org/wiki/Permuted_congruential_generator) family
is the next step in LCG technology. We start with LCG output, which is good but is the next step in LCG technology. We start with LCG output, which is good but
not great, and then we apply one of several possible permutations to bump up the not great, and then we apply one of several possible permutations to bump up the
quality. There's basically a bunch of permutation components that are each quality. There's basically a bunch of permutation components that are each
defined in terms of the bit width that you're working with. The "default" defined in terms of the bit width that you're working with.
variant of PCG, PCG32, has 64 bits of state and 32 bits of output, and it uses
the "XSH-RR" permutation.
Obviously we'll have 32 bits of state, and so 16 bits of output. The "default" variant of PCG, PCG32, has 64 bits of state and 32 bits of output,
and it uses the "XSH-RR" permutation. Here we'll put together a 32 bit version
* **XSH:** we do an xor shift, `x ^= x >> constant`, with the constant being half with 16-bit output, and using the "XSH-RS" permutation (but we'll show the other
the bits _not_ discarded by the next operation (the RR). one too for comparison).
* **RR:** we do a randomized rotation, with output half the size of the input.
This part gets a little tricky so we'll break it down into more bullet points.
* Given a 2^b-bit input word, we have 32-bit input, `b = 5`
* the top b1 bits are used for the rotate amount, `rotate 4`
* the next-most-significant 2^b1 bits are rotated right and used as the
output, `rotate the 16 bits after the top 4 bits`
* and the low 2^b1+1b bits are discarded, `discard the rest`
* This also means that the "bits not discarded" is 16+4, so the XSH constant
will be 20/2=10.
Of course, since PCG is based on a LCG, we have to start with a good LCG base. Of course, since PCG is based on a LCG, we have to start with a good LCG base.
As I said above, a better or worse set of LCG constants can make your generator As I said above, a better or worse set of LCG constants can make your generator
@ -353,6 +401,7 @@ is 8 bits or less, so we haven't used them too much ourselves yet.
I guess we'll pick 5, because I happen to personally like the number. I guess we'll pick 5, because I happen to personally like the number.
```rust ```rust
// Demo only. The "default" PCG permutation, for use when rotate is cheaper
pub fn pcg16_xsh_rr(seed: &mut u32) -> u16 { pub fn pcg16_xsh_rr(seed: &mut u32) -> u16 {
*seed = seed.wrapping_mul(32310901).wrapping_add(5); *seed = seed.wrapping_mul(32310901).wrapping_add(5);
const INPUT_SIZE: u32 = 32; const INPUT_SIZE: u32 = 32;
@ -363,26 +412,8 @@ pub fn pcg16_xsh_rr(seed: &mut u32) -> u16 {
out32 ^= out32 >> ((OUTPUT_SIZE + ROTATE_BITS) / 2); out32 ^= out32 >> ((OUTPUT_SIZE + ROTATE_BITS) / 2);
((out32 >> (OUTPUT_SIZE - ROTATE_BITS)) as u16).rotate_right(rot) ((out32 >> (OUTPUT_SIZE - ROTATE_BITS)) as u16).rotate_right(rot)
} }
```
[Compiler Explorer](https://rust.godbolt.org/z/rGTj7D) // This has slightly worse statistics but runs much better on the GBA
### PCG16 XSH-RS (32-bit state, 16-bit output, uniform)
Instead of doing a random rotate, we can also do a random shift.
* **RS:** A random (input-dependent) shift, for cases where rotates are more
expensive. Again, the output is half the size of the input.
* Beginning with a 2^b-bit input word, `b = 5`
* the top b3 bits are used for a shift amount, `shift = 2`
* which is applied to the next-most-significant 2^b1+2^b31 bits, `the next
19 bits`
* and the least significant 2b1 bits of the result are output. `output = 16`
* The low 2b12b3b+4 bits are discarded. `discard the rest`
* the "bits not discarded" for the XSH step 16+2, so the XSH constant will be
18/2=9.
```rust
pub fn pcg16_xsh_rs(seed: &mut u32) -> u16 { pub fn pcg16_xsh_rs(seed: &mut u32) -> u16 {
*seed = seed.wrapping_mul(32310901).wrapping_add(5); *seed = seed.wrapping_mul(32310901).wrapping_add(5);
const INPUT_SIZE: u32 = 32; const INPUT_SIZE: u32 = 32;
@ -396,12 +427,7 @@ pub fn pcg16_xsh_rs(seed: &mut u32) -> u16 {
} }
``` ```
[Compiler Explorer](https://rust.godbolt.org/z/EvzCAG) [Compiler Explorer](https://rust.godbolt.org/z/NtJAwS)
Turns out this a fairly significant savings on instructions. We're theoretically
trading in a bit of statistical quality for these speed gains, but a 32-bit
generator was never going to pass muster anyway, so we might as well go with
this for our 32->16 generator.
### PCG32 RXS-M-XS (32-bit state, 32-bit output, uniform) ### PCG32 RXS-M-XS (32-bit state, 32-bit output, uniform)
@ -411,17 +437,6 @@ of dimensional equidistribution for each bit you discard as the size goes down
(so 32->16 gives 16). However, if your output size _has_ to the the same as your (so 32->16 gives 16). However, if your output size _has_ to the the same as your
input size, the PCG family is still up to the task. input size, the PCG family is still up to the task.
* **RXS:** An xorshift by a random (input-dependent) amount.
* **M:** A multiply by a fixed constant.
* **XS:** An xorshift by a fixed amount. This improves the bits in the lowest
third of bits using the upper third.
For this part, wikipedia doesn't explain as much of the backing math, and
honestly even [the paper
itself](http://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf) also doesn't quite
do a good job of it. However, rejoice, the wikipedia article lists what we
should do for 32->32, so we can just cargo cult it.
```rust ```rust
pub fn pcg32_rxs_m_xs(seed: &mut u32) -> u32 { pub fn pcg32_rxs_m_xs(seed: &mut u32) -> u32 {
*seed = seed.wrapping_mul(32310901).wrapping_add(5); *seed = seed.wrapping_mul(32310901).wrapping_add(5);
@ -430,10 +445,78 @@ pub fn pcg32_rxs_m_xs(seed: &mut u32) -> u32 {
out32 ^= out32 >> (4 + rxs); out32 ^= out32 >> (4 + rxs);
const PURE_MAGIC: u32 = 277803737; const PURE_MAGIC: u32 = 277803737;
out32 *= PURE_MAGIC; out32 *= PURE_MAGIC;
x ^ (x >> 22) out32^ (out32 >> 22)
} }
``` ```
[Compiler Explorer](https://rust.godbolt.org/z/j3KPId)
This permutation is the slowest but gives the strongest statistical benefits. If
you're going to be keeping 100% of the output bits you want the added strength
obviously. However, the period isn't actually any longer, so each output will be
given only once within the full period (1-dimensional equidistribution).
### PCG Extension Array
As a general improvement to any PCG you can hook on an "extension array" to give
yourself a longer period. It's all described in the [PCG
Paper](http://www.pcg-random.org/paper.html), but here's the bullet points:
* In addition to your generator's state (and possible stream) you keep an array
of "extension" values. The array _type_ is the same as your output type, and
the array _count_ must be a power of two value that's less than the maximum
value of your state size.
* When you run the generator, use the _lowest_ bits to select from your
extension array according to the array's power of two. Eg: if the size is 2
then use the single lowest bit, if it's 4 then use the lowest 2 bits, etc.
* Every time you run the generator, XOR the output with the selected value from
the array.
* Every time the generator state lands on 0, cycle every element of the array.
Here's an example using an 8 slot array and `pcg16_xsh_rs`:
```rust
// uses pcg16_xsh_rs from above
// I asked ubsan and they said this is the best way to absolutely ensure that
// our extension array is aligned so that we can pretend it's a `u32` array
// later. When it comes to memory safety, you always do what ubsan says.
#[repr(align(4))]
struct AlignedU16Array([u16; 8]);
pub struct PCG16_EXT8 {
state: u32,
ext: AlignedU16Array,
}
impl PCG16_EXT8 {
pub fn next_u16(&mut self) -> u16 {
// PCG as normal.
let mut out = pcg16_xsh_rs(&mut self.state);
// XOR with a selected extension array value
out ^= unsafe { self.ext.0.get_unchecked((self.state & !0b111) as usize) };
// if state == 0 we cycle the array by sending each u16 pair though the
// normal LCG process.
if self.state == 0 {
unsafe {
let mut ptr = self.ext.0.as_mut_ptr() as *mut u16 as *mut u32;
for _ in 0..4 {
*ptr = (*ptr).wrapping_mul(32310901).wrapping_add(5);
ptr = ptr.offset(1);
}
}
}
out
}
}
```
[Compiler Explorer](https://rust.godbolt.org/z/HTxoHY)
The period gained from using an extension array is quite impressive. For a b-bit
generator giving r-bit outputs, and k array slots, the period goes from 2^b to
2^(k*r+b). So our 2^32 period generator has been extended to 2^160.
### Xoshiro128** (128-bit state, 32-bit output, non-uniform) ### Xoshiro128** (128-bit state, 32-bit output, non-uniform)
The [Xoshiro128**](http://xoshiro.di.unimi.it/xoshiro128starstar.c) generator is The [Xoshiro128**](http://xoshiro.di.unimi.it/xoshiro128starstar.c) generator is
@ -939,7 +1022,7 @@ complicated.
Life just be that way, I guess. Life just be that way, I guess.
## Summary ## Summary Table
That was a whole lot. Let's put them in a table: That was a whole lot. Let's put them in a table:
@ -947,12 +1030,8 @@ That was a whole lot. Let's put them in a table:
|:---------------|:-----:|:------:|:------:|:-----:| |:---------------|:-----:|:------:|:------:|:-----:|
| sm64 | 2 | u16 | 65,114 | 0 | | sm64 | 2 | u16 | 65,114 | 0 |
| lcg32 | 4 | u16 | 2^32 | 1 | | lcg32 | 4 | u16 | 2^32 | 1 |
| pcg16_xsh_rr | 4 | u16 | 2^32 | 16 | | pcg16_xsh_rs | 4 | u16 | 2^32 | 1 |
| pcg16_xsh_rs | 4 | u16 | 2^32 | 16 |
| pcg32_rxs_m_xs | 4 | u32 | 2^32 | 1 | | pcg32_rxs_m_xs | 4 | u32 | 2^32 | 1 |
| PCG16_EXT8 | 20 | u16 | 2^160 | 8 |
| xoshiro128** | 16 | u32 | 2^128-1| 0 | | xoshiro128** | 16 | u32 | 2^128-1| 0 |
| jsf32 | 16 | u32 | ~2^126 | 0 | | jsf32 | 16 | u32 | ~2^126 | 0 |
TODO recap streams/jumps
TODO extension arrays?

View file

@ -561,3 +561,15 @@ pub fn bounded_rand32(rng: &mut impl FnMut() -> u32, mut range: u32) -> u32 {
} }
x x
} }
pub const TM0D: VolatilePtr<u16> = VolatilePtr(0x400_0100 as *mut u16);
pub const TM0CNT: VolatilePtr<u16> = VolatilePtr(0x400_0102 as *mut u16);
pub const TM1D: VolatilePtr<u16> = VolatilePtr(0x400_0104 as *mut u16);
pub const TM1CNT: VolatilePtr<u16> = VolatilePtr(0x400_0106 as *mut u16);
pub const TM2D: VolatilePtr<u16> = VolatilePtr(0x400_0108 as *mut u16);
pub const TM2CNT: VolatilePtr<u16> = VolatilePtr(0x400_010A as *mut u16);
pub const TM3D: VolatilePtr<u16> = VolatilePtr(0x400_010C as *mut u16);
pub const TM3CNT: VolatilePtr<u16> = VolatilePtr(0x400_010E as *mut u16);

View file

@ -352,28 +352,28 @@ pub const DMA3CNT_L: VolatilePtr<u16> = VolatilePtr(0x400_00DC as *mut u16);
pub const DMA3CNT_H: VolatilePtr<u16> = VolatilePtr(0x400_00DE as *mut u16); pub const DMA3CNT_H: VolatilePtr<u16> = VolatilePtr(0x400_00DE as *mut u16);
/// Timer 0 Counter/Reload /// Timer 0 Counter/Reload
pub const TM0CNT_L: VolatilePtr<u16> = VolatilePtr(0x400_0100 as *mut u16); pub const TM0D: VolatilePtr<u16> = VolatilePtr(0x400_0100 as *mut u16);
/// Timer 0 Control /// Timer 0 Control
pub const TM0CNT_H: VolatilePtr<u16> = VolatilePtr(0x400_0102 as *mut u16); pub const TM0CNT: VolatilePtr<u16> = VolatilePtr(0x400_0102 as *mut u16);
/// Timer 1 Counter/Reload /// Timer 1 Counter/Reload
pub const TM1CNT_L: VolatilePtr<u16> = VolatilePtr(0x400_0104 as *mut u16); pub const TM1D: VolatilePtr<u16> = VolatilePtr(0x400_0104 as *mut u16);
/// Timer 1 Control /// Timer 1 Control
pub const TM1CNT_H: VolatilePtr<u16> = VolatilePtr(0x400_0106 as *mut u16); pub const TM1CNT: VolatilePtr<u16> = VolatilePtr(0x400_0106 as *mut u16);
/// Timer 2 Counter/Reload /// Timer 2 Counter/Reload
pub const TM2CNT_L: VolatilePtr<u16> = VolatilePtr(0x400_0108 as *mut u16); pub const TM2D: VolatilePtr<u16> = VolatilePtr(0x400_0108 as *mut u16);
/// Timer 2 Control /// Timer 2 Control
pub const TM2CNT_H: VolatilePtr<u16> = VolatilePtr(0x400_010A as *mut u16); pub const TM2CNT: VolatilePtr<u16> = VolatilePtr(0x400_010A as *mut u16);
/// Timer 3 Counter/Reload /// Timer 3 Counter/Reload
pub const TM3CNT_L: VolatilePtr<u16> = VolatilePtr(0x400_010C as *mut u16); pub const TM3D: VolatilePtr<u16> = VolatilePtr(0x400_010C as *mut u16);
/// Timer 3 Control /// Timer 3 Control
pub const TM3CNT_H: VolatilePtr<u16> = VolatilePtr(0x400_010E as *mut u16); pub const TM3CNT: VolatilePtr<u16> = VolatilePtr(0x400_010E as *mut u16);
/// SIO Data (Normal-32bit Mode; shared with below) /// SIO Data (Normal-32bit Mode; shared with below)
pub const SIODATA32: VolatilePtr<u32> = VolatilePtr(0x400_0120 as *mut u32); pub const SIODATA32: VolatilePtr<u32> = VolatilePtr(0x400_0120 as *mut u32);