I managed to botch the bounds last time.
uint32 c;
/* Do sequential carry propagation (16-bit CPUs are less likely to benefit
uint32 c;
/* Do sequential carry propagation (16-bit CPUs are less likely to benefit
- * from instruction-level parallelism). Start at u_10; truncate it to 11
- * bits, and add the carry onto u_11. Truncate u_11 to 10 bits, and add
- * five times the carry onto u_0. And so on.
+ * from instruction-level parallelism). Start at u_9; truncate it to 11
+ * bits, and add the carry onto u_10. Truncate u10 to 11 bits, and add the
+ * carry onto u_11. Truncate u_11 to 10 bits, and add five times the carry
+ * onto u_0. And so on.
*
* The carry is larger than the pieces we're leaving behind. Let c_i be
* the high portion of u_i, to be carried onto u_{i+1}. I claim that c_i <
*
* The carry is larger than the pieces we're leaving behind. Let c_i be
* the high portion of u_i, to be carried onto u_{i+1}. I claim that c_i <
* carry is bounded above by 636*2^22 + 12785*2^10 < 2557*2^20. Hence, the
* carry out is at most 2557*2^10, as claimed.
*
* carry is bounded above by 636*2^22 + 12785*2^10 < 2557*2^20. Hence, the
* carry out is at most 2557*2^10, as claimed.
*
- * Once we reach u_10 for the second time, we start with u_10 < 2^11. The
- * carry into u_10 is at most 2557*2^10 < 1279*2^11 as calculated above; so
- * the carry out into u_11 is at most 1280. Since u_11 < 2^10 prior to
- * this carry in, all of the u_i are now bounded above by 2^11. The final
- * reduction therefore only needs a conditional subtraction.
+ * Once we reach u_9 for the second time, we start with u_9 < 2^11. The
+ * carry into u_9 is at most 2557*2^10 < 1279*2^11 as calculated above; so
+ * the carry out into u_10 is at most 1280. Since u_10 < 2^11 prior to
+ * this carry in, we now have u_10 < 2^11 + 1280 < 2^12; so the carry out
+ * into u_11 is at most 1. The final reduction therefore only needs a
+ * conditional subtraction.
- { c = u[10] >> 11; u[10] &= M11; }
+ { c = u[9] >> 11; u[9] &= M11; }
+ { u[10] += c; c = u[10] >> 11; u[10] &= M11; }
{ u[11] += c; c = u[11] >> 10; u[11] &= M10; }
{ u[0] += 5*c; c = u[0] >> 11; u[0] &= M11; }
for (i = 1; i < 5; i++) { u[i] += c; c = u[i] >> 11; u[i] &= M11; }
{ u[11] += c; c = u[11] >> 10; u[11] &= M10; }
{ u[0] += 5*c; c = u[0] >> 11; u[0] &= M11; }
for (i = 1; i < 5; i++) { u[i] += c; c = u[i] >> 11; u[i] &= M11; }