The Windows code doesn't need to spill r12, because we don't need the
`mi' register after we've loaded and expanded the Montgomery factor.
This doesn't save any stack space because we need 16-byte alignment, but
it does avoid saving and restoring the register.