Thread: Bug in numeric multiplication

Bug in numeric multiplication

From
Dean Rasheed
Date:
Hi,

By chance, while testing the nearby numeric log/exp/pow patch, I came
across the following case which generates an initially puzzling
looking error on HEAD -- (5.6-1e-500) ^ (3.2-1e-200). This computation
actually works OK with that other patch, but only by blind luck. It
turns out that the underlying problem is a bug in the low-level
numeric multiplication function mul_var(). It is possible to trigger
it directly with the following carefully crafted inputs:

select 4790999999999999999999999999999999999999999999999999999999999999999999999999999999999999
* 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;

Result:

47909999999999999999999999999999999999999999999999999999999999999999999999999999997852304953000000000000000000000000000000000000000000000000000000000000000000000000000000000001

That answer is actually incorrect. Tweaking the input a little, it is
possible to generate a much more obviously nonsensical result:

select 4789999999999999999999999999999999999999999999999999999999999999999999999999999999999999
* 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;

Result:

4789999999999999999999999999999999999999999999999999999999999999999999999999999999785231+0,*000000000000000000000000000000000000000000000000000000000000000000000000000000000001

Notice those garbage digits in the middle of the number returned.

The problem is that these examples trigger an overflow of the digits
in the accumulator array in mul_var().

The number on the left in the first example consists of 21 copies of
9999, preceded by 4790. Those are chosen so that when added together
they lead to a value for maxdig in mul_var() of 21*9999 + 4790 =
214769, which is exactly equal to INT_MAX / (NBASE - 1). So this
doesn't quite trigger a normalisation of the accumulator array, and
leaves several of the digits in that array a little under INT_MAX at
the end of the main multiplication loop.

The problem then arises in the final carry propagation pass. During
this phase of the computation, the carry from one digit (which can be
a shade under INT_MAX / NBASE) is added to the next digit, and that's
where the overflow happens.

To fix that, the initial value for maxdig needs to be made larger to
leave headroom for the carry. The largest possible carry is INT_MAX /
NBASE, and maxdig is the maximum possible dig value divided by
NBASE-1, so maxdig needs to be initialised to

 (INT_MAX / NBASE) / (NBASE - 1)

which is 21 for NBASE = 10000.

A new corner-case input that doesn't quite trigger an accumulator
normalisation is then 47699999... The worst case inputs are now values
like this for which the sum of a sequence of input digits is INT_MAX /
(NBASE - 1) - 21 = 214769 - 21 = 214748. So in the worst case, the
accumulator's digits can be up to 214748 * 9999 = 2147265252 in the
main multiplication loop. Then, during the carry propagation phase (or
any of the normalisation phases), the carry can be anything up to
INT_MAX / NBASE = 214748. So the maximum value that can be assigned to
any individual digit is now 2147265252 + 214748 = 2147480000, which is
now less than INT_MAX.

Patch attached.

Regards,
Dean

Attachment

Re: Bug in numeric multiplication

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> The problem then arises in the final carry propagation pass. During
> this phase of the computation, the carry from one digit (which can be
> a shade under INT_MAX / NBASE) is added to the next digit, and that's
> where the overflow happens.

Nice catch!  I think the comment could use a little more work, but I'll
adjust it and push.
        regards, tom lane



Re: Bug in numeric multiplication

From
Tom Lane
Date:
I wrote:
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> The problem then arises in the final carry propagation pass. During
>> this phase of the computation, the carry from one digit (which can be
>> a shade under INT_MAX / NBASE) is added to the next digit, and that's
>> where the overflow happens.

> Nice catch!  I think the comment could use a little more work, but I'll
> adjust it and push.

After trying to rework the comment to explain what maxdig really meant
after your changes, I came to the conclusion that it'd be better to do
it as per attached.  Does this look sane to you?

            regards, tom lane

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 1bfa29e..d403554 100644
*** a/src/backend/utils/adt/numeric.c
--- b/src/backend/utils/adt/numeric.c
*************** mul_var(NumericVar *var1, NumericVar *va
*** 5789,5796 ****
       * to avoid normalizing carries immediately.
       *
       * maxdig tracks the maximum possible value of any dig[] entry; when this
!      * threatens to exceed INT_MAX, we take the time to propagate carries. To
!      * avoid overflow in maxdig itself, it actually represents the max
       * possible value divided by NBASE-1.
       */
      dig = (int *) palloc0(res_ndigits * sizeof(int));
--- 5789,5801 ----
       * to avoid normalizing carries immediately.
       *
       * maxdig tracks the maximum possible value of any dig[] entry; when this
!      * threatens to exceed INT_MAX, we take the time to propagate carries.
!      * Furthermore, we need to ensure that overflow doesn't occur during the
!      * carry propagation pass below either.  The carry value could be as much
!      * as INT_MAX/NBASE, so really we should normalize when digits threaten to
!      * exceed INT_MAX - INT_MAX/NBASE.
!      *
!      * To avoid overflow in maxdig itself, it actually represents the max
       * possible value divided by NBASE-1.
       */
      dig = (int *) palloc0(res_ndigits * sizeof(int));
*************** mul_var(NumericVar *var1, NumericVar *va
*** 5806,5812 ****

          /* Time to normalize? */
          maxdig += var1digit;
!         if (maxdig > INT_MAX / (NBASE - 1))
          {
              /* Yes, do it */
              carry = 0;
--- 5811,5817 ----

          /* Time to normalize? */
          maxdig += var1digit;
!         if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
          {
              /* Yes, do it */
              carry = 0;
diff --git a/src/test/regress/expected/numeric.out b/src/test/regress/expected/numeric.out
index e6ee548..c1886fd 100644
*** a/src/test/regress/expected/numeric.out
--- b/src/test/regress/expected/numeric.out
*************** SELECT * FROM num_input_test;
*** 1334,1339 ****
--- 1334,1366 ----
  (7 rows)

  --
+ -- Test some corner cases for multiplication
+ --
+ select 4790999999999999999999999999999999999999999999999999999999999999999999999999999999999999 *
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
+                                                                                      ?column?
                                                            
+
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
47909999999999999999999999999999999999999999999999999999999999999999999999999999999999985209000000000000000000000000000000000000000000000000000000000000000000000000000000000001
+ (1 row)
+
+ select 4789999999999999999999999999999999999999999999999999999999999999999999999999999999999999 *
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
+                                                                                      ?column?
                                                            
+
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
47899999999999999999999999999999999999999999999999999999999999999999999999999999999999985210000000000000000000000000000000000000000000000000000000000000000000000000000000000001
+ (1 row)
+
+ select 4770999999999999999999999999999999999999999999999999999999999999999999999999999999999999 *
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
+                                                                                      ?column?
                                                            
+
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
47709999999999999999999999999999999999999999999999999999999999999999999999999999999999985229000000000000000000000000000000000000000000000000000000000000000000000000000000000001
+ (1 row)
+
+ select 4769999999999999999999999999999999999999999999999999999999999999999999999999999999999999 *
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
+                                                                                      ?column?
                                                            
+
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
47699999999999999999999999999999999999999999999999999999999999999999999999999999999999985230000000000000000000000000000000000000000000000000000000000000000000000000000000000001
+ (1 row)
+
+ --
  -- Test some corner cases for division
  --
  select 999999999999999999999::numeric/1000000000000000000000;
diff --git a/src/test/regress/sql/numeric.sql b/src/test/regress/sql/numeric.sql
index 982287c..49ec478 100644
*** a/src/test/regress/sql/numeric.sql
--- b/src/test/regress/sql/numeric.sql
*************** INSERT INTO num_input_test(n1) VALUES ('
*** 822,827 ****
--- 822,839 ----
  SELECT * FROM num_input_test;

  --
+ -- Test some corner cases for multiplication
+ --
+
+ select 4790999999999999999999999999999999999999999999999999999999999999999999999999999999999999 *
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
+
+ select 4789999999999999999999999999999999999999999999999999999999999999999999999999999999999999 *
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
+
+ select 4770999999999999999999999999999999999999999999999999999999999999999999999999999999999999 *
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
+
+ select 4769999999999999999999999999999999999999999999999999999999999999999999999999999999999999 *
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
+
+ --
  -- Test some corner cases for division
  --


Re: Bug in numeric multiplication

From
Dean Rasheed
Date:
On 21 September 2015 at 16:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>>> The problem then arises in the final carry propagation pass. During
>>> this phase of the computation, the carry from one digit (which can be
>>> a shade under INT_MAX / NBASE) is added to the next digit, and that's
>>> where the overflow happens.
>
>> Nice catch!  I think the comment could use a little more work, but I'll
>> adjust it and push.
>
> After trying to rework the comment to explain what maxdig really meant
> after your changes, I came to the conclusion that it'd be better to do
> it as per attached.  Does this look sane to you?
>

Yes that looks better. It's still the same amount of extra headroom
(21), but I think it's clearer your way.

Regards,
Dean



Re: Bug in numeric multiplication

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 21 September 2015 at 16:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> After trying to rework the comment to explain what maxdig really meant
>> after your changes, I came to the conclusion that it'd be better to do
>> it as per attached.  Does this look sane to you?

> Yes that looks better. It's still the same amount of extra headroom
> (21), but I think it's clearer your way.

OK, pushed (after further hacking on the comment ...)
        regards, tom lane



Re: Bug in numeric multiplication

From
Dean Rasheed
Date:
On 21 September 2015 at 17:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> On 21 September 2015 at 16:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> After trying to rework the comment to explain what maxdig really meant
>>> after your changes, I came to the conclusion that it'd be better to do
>>> it as per attached.  Does this look sane to you?
>
>> Yes that looks better. It's still the same amount of extra headroom
>> (21), but I think it's clearer your way.
>
> OK, pushed (after further hacking on the comment ...)
>
>                         regards, tom lane

I just noticed that div_var_fast() has almost identical code, and so
in principle it has the same vulnerability, although it obviously only
affects the transcendental functions.

I don't actually have a test case that triggers it, but it's basically
the same algorithm, so logically it needs the same additional headroom
to avoid a possible overflow.

Regards,
Dean



Re: Bug in numeric multiplication

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> I just noticed that div_var_fast() has almost identical code, and so
> in principle it has the same vulnerability, although it obviously only
> affects the transcendental functions.
> I don't actually have a test case that triggers it, but it's basically
> the same algorithm, so logically it needs the same additional headroom
> to avoid a possible overflow.

Hm, good point.  I don't feel a compulsion to have a test case that
proves it's broken before we fix it.  Do you want to send a patch?
        regards, tom lane



Re: Bug in numeric multiplication

From
Dean Rasheed
Date:
On 17 November 2015 at 14:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> I just noticed that div_var_fast() has almost identical code, and so
>> in principle it has the same vulnerability, although it obviously only
>> affects the transcendental functions.
>> I don't actually have a test case that triggers it, but it's basically
>> the same algorithm, so logically it needs the same additional headroom
>> to avoid a possible overflow.
>
> Hm, good point.  I don't feel a compulsion to have a test case that
> proves it's broken before we fix it.  Do you want to send a patch?
>

OK, here it is. It's slightly different from mul_var, because maxdiv
is tracking absolute values and the carry may be positive or negative,
and it's absolute value may be as high as INT_MAX / NBASE + 1 (when
it's negative), but otherwise the principle is the same.

Regards,
Dean

Attachment

Re: Bug in numeric multiplication

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 17 November 2015 at 14:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hm, good point.  I don't feel a compulsion to have a test case that
>> proves it's broken before we fix it.  Do you want to send a patch?

> OK, here it is. It's slightly different from mul_var, because maxdiv
> is tracking absolute values and the carry may be positive or negative,
> and it's absolute value may be as high as INT_MAX / NBASE + 1 (when
> it's negative), but otherwise the principle is the same.

I pushed this, but while looking at it, my attention was drawn to this
bit down near the end of the loop:
       /*        * The dividend digit we are about to replace might still be nonzero.        * Fold it into the next
digitposition.  We don't need to worry about        * overflow here since this should nearly cancel with the
subtraction       * of the divisor.        */       div[qi + 1] += div[qi] * NBASE;
 

In the first place, I'm not feeling especially convinced by that handwave
about there being no risk of overflow.  In the second place, this is
indisputably failing to consider whether maxdiv might need to increase.
If I add
       Assert(Abs(div[qi + 1]) <= (maxdiv * (NBASE-1)));

right after this, the regression tests blow up.  So it looks to me like
we've got some more work to do.
        regards, tom lane



Re: Bug in numeric multiplication

From
Tom Lane
Date:
I wrote:
> I pushed this, but while looking at it, my attention was drawn to this
> bit down near the end of the loop:

>         /*
>          * The dividend digit we are about to replace might still be nonzero.
>          * Fold it into the next digit position.  We don't need to worry about
>          * overflow here since this should nearly cancel with the subtraction
>          * of the divisor.
>          */
>         div[qi + 1] += div[qi] * NBASE;

> In the first place, I'm not feeling especially convinced by that handwave
> about there being no risk of overflow.  In the second place, this is
> indisputably failing to consider whether maxdiv might need to increase.
> If I add
>         Assert(Abs(div[qi + 1]) <= (maxdiv * (NBASE-1)));
> right after this, the regression tests blow up.  So it looks to me like
> we've got some more work to do.

After thinking some more about what this is doing, it seems to me that
we could avoid changing div[qi + 1] at all here, and instead deal with
leftover dividend digits by shoving them into the floating-point part of
the calculation.  All that we really need to have happen as a consequence
of the above is to inject an additional "div[qi] * NBASE" term into future
calculations of fdividend.  So we can do that out-of-band and avoid
mucking up the maxdiv invariant.

Also, I realized that this bit:

                /*
                 * All the div[] digits except possibly div[qi] are now in the
                 * range 0..NBASE-1.
                 */
                maxdiv = Abs(newdig) / (NBASE - 1);
                maxdiv = Max(maxdiv, 1);

is unduly conservative, because, while indeed div[qi] might be out of
range, there is no need to consider it in maxdiv anymore, because the new
maxdiv value only determines what will happen in future loop iterations
where the current div[qi] will be irrelevant.  So we should just reset
maxdiv to 1 unconditionally here.

So that led me to the attached patch, which passes regression tests fine.
I'm not sure that it's provably okay though.  The loose ends are to show
that div[qi] can't overflow an int during the divisor-subtraction step
and that "outercarry" remains bounded.   Testing suggests that outercarry
can't exceed INT_MAX/NBASE, but I don't see how to prove that.

            regards, tom lane

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index dcdc5cf..03a646b 100644
*** a/src/backend/utils/adt/numeric.c
--- b/src/backend/utils/adt/numeric.c
*************** div_var_fast(NumericVar *var1, NumericVa
*** 6186,6192 ****
      double        fdividend,
                  fdivisor,
                  fdivisorinverse,
!                 fquotient;
      int            qi;
      int            i;

--- 6186,6193 ----
      double        fdividend,
                  fdivisor,
                  fdivisorinverse,
!                 fquotient,
!                 outercarry;
      int            qi;
      int            i;

*************** div_var_fast(NumericVar *var1, NumericVa
*** 6274,6289 ****
       * To avoid overflow in maxdiv itself, it represents the max absolute
       * value divided by NBASE-1, ie, at the top of the loop it is known that
       * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
       */
      maxdiv = 1;

      /*
!      * Outer loop computes next quotient digit, which will go into div[qi]
       */
      for (qi = 0; qi < div_ndigits; qi++)
      {
          /* Approximate the current dividend value */
!         fdividend = (double) div[qi];
          for (i = 1; i < 4; i++)
          {
              fdividend *= NBASE;
--- 6275,6297 ----
       * To avoid overflow in maxdiv itself, it represents the max absolute
       * value divided by NBASE-1, ie, at the top of the loop it is known that
       * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
+      *
+      * Note that maxdiv only includes digit positions that are still part of
+      * the dividend, ie, strictly speaking the above holds only for div[i]
+      * where i >= qi, at the top of the loop.
       */
      maxdiv = 1;

      /*
!      * Outer loop computes next quotient digit, which will go into div[qi].
!      * "outercarry" represents high-order dividend digits carried across
!      * iterations.
       */
+     outercarry = 0;
      for (qi = 0; qi < div_ndigits; qi++)
      {
          /* Approximate the current dividend value */
!         fdividend = outercarry * NBASE + (double) div[qi];
          for (i = 1; i < 4; i++)
          {
              fdividend *= NBASE;
*************** div_var_fast(NumericVar *var1, NumericVa
*** 6320,6340 ****
                          carry = 0;
                      div[i] = newdig;
                  }
!                 newdig = div[qi] + carry;
!                 div[qi] = newdig;

                  /*
                   * All the div[] digits except possibly div[qi] are now in the
!                  * range 0..NBASE-1.
                   */
!                 maxdiv = Abs(newdig) / (NBASE - 1);
!                 maxdiv = Max(maxdiv, 1);

                  /*
                   * Recompute the quotient digit since new info may have
                   * propagated into the top four dividend digits
                   */
!                 fdividend = (double) div[qi];
                  for (i = 1; i < 4; i++)
                  {
                      fdividend *= NBASE;
--- 6328,6347 ----
                          carry = 0;
                      div[i] = newdig;
                  }
!                 div[qi] += carry;

                  /*
                   * All the div[] digits except possibly div[qi] are now in the
!                  * range 0..NBASE-1; and we don't care about including div[qi]
!                  * anymore, so we can reset maxdiv to 1.
                   */
!                 maxdiv = 1;

                  /*
                   * Recompute the quotient digit since new info may have
                   * propagated into the top four dividend digits
                   */
!                 fdividend = outercarry * NBASE + (double) div[qi];
                  for (i = 1; i < 4; i++)
                  {
                      fdividend *= NBASE;
*************** div_var_fast(NumericVar *var1, NumericVa
*** 6360,6370 ****

          /*
           * The dividend digit we are about to replace might still be nonzero.
!          * Fold it into the next digit position.  We don't need to worry about
!          * overflow here since this should nearly cancel with the subtraction
!          * of the divisor.
           */
!         div[qi + 1] += div[qi] * NBASE;

          div[qi] = qdigit;
      }
--- 6367,6382 ----

          /*
           * The dividend digit we are about to replace might still be nonzero.
!          * Carry it forward to future division steps in "outercarry".
!          *
!          * Although it might appear that outercarry would soon grow large
!          * enough to lose precision, this is not actually possible because
!          * each divisor-subtraction step will drive the accumulated value
!          * towards zero.  We could probably store it in an int without
!          * overflow problems, but since the value is only used in floating
!          * calculations, we might as well store it as a double.
           */
!         outercarry = outercarry * NBASE + (double) div[qi];

          div[qi] = qdigit;
      }
*************** div_var_fast(NumericVar *var1, NumericVa
*** 6372,6378 ****
      /*
       * Approximate and store the last quotient digit (div[div_ndigits])
       */
!     fdividend = (double) div[qi];
      for (i = 1; i < 4; i++)
          fdividend *= NBASE;
      fquotient = fdividend * fdivisorinverse;
--- 6384,6390 ----
      /*
       * Approximate and store the last quotient digit (div[div_ndigits])
       */
!     fdividend = outercarry * NBASE + (double) div[qi];
      for (i = 1; i < 4; i++)
          fdividend *= NBASE;
      fquotient = fdividend * fdivisorinverse;

Re: Bug in numeric multiplication

From
Dean Rasheed
Date:
On 17 November 2015 at 23:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After thinking some more about what this is doing, it seems to me that
> we could avoid changing div[qi + 1] at all here, and instead deal with
> leftover dividend digits by shoving them into the floating-point part of
> the calculation.  All that we really need to have happen as a consequence
> of the above is to inject an additional "div[qi] * NBASE" term into future
> calculations of fdividend.  So we can do that out-of-band and avoid
> mucking up the maxdiv invariant.
>

That makes sense.

> Also, I realized that this bit:
>
>                 /*
>                  * All the div[] digits except possibly div[qi] are now in the
>                  * range 0..NBASE-1.
>                  */
>                 maxdiv = Abs(newdig) / (NBASE - 1);
>                 maxdiv = Max(maxdiv, 1);
>
> is unduly conservative, because, while indeed div[qi] might be out of
> range, there is no need to consider it in maxdiv anymore, because the new
> maxdiv value only determines what will happen in future loop iterations
> where the current div[qi] will be irrelevant.  So we should just reset
> maxdiv to 1 unconditionally here.
>

OK, makes sense too.

> So that led me to the attached patch, which passes regression tests fine.

I'd feel better about that if the regression tests actually did
something that stressed this, but coming up with such a test is
non-trivial.

> I'm not sure that it's provably okay though.  The loose ends are to show
> that div[qi] can't overflow an int during the divisor-subtraction step
> and that "outercarry" remains bounded.   Testing suggests that outercarry
> can't exceed INT_MAX/NBASE, but I don't see how to prove that.
>

At the top of the loop we know that
 Abs(div[qi]) <= maxdiv * (NBASE-1)              <= INT_MAX - INT_MAX/NBASE - 1

Then we add Abs(qdigit) to maxdiv which potentially triggers a
normalisation step. That step adds carry to div[qi], and we know that
 Abs(carry) <= INT_MAX/NBASE + 1

so the result is that Abs(div[qi]) <= INT_MAX -- i.e., it can't
overflow at that point, but it could be on the verge of doing so.

Then the divisor-subtraction step does
 div[qi] -= qdigit * var2digits[0]

That looks as though it could overflow, although actually I would
expect qdigit to have the same sign as div[qi], so that this would
drive div[qi] towards zero. If you didn't want to rely on that though,
you could take advantage of the fact that this new value of div[qi] is
only needed for outercarry, so you could modify the
divisor-subtraction step to skip div[qi]:
 for (i = 1; i < istop; i++)   div[qi + i] -= qdigit * var2digits[i];

and fold the most significant digit of the divisor-subtraction step
into the computation of outercarry so that there is definitely no
possibility of integer overflow:
 outercarry = outercarry * NBASE + (double) div[qi]            - (double) qdigit * var2digits[0];

It's still difficult to prove that outercarry remains bounded, but to
a first approximation
 qdigit ~= ( outercarry * NBASE + (double) div[qi] ) / var2digits[0]

so outercarry ought to be driven towards zero, as the comment says. It
certainly doesn't look like it will grow without bounds, but I haven't
attempted to work out what its bounds actually are.

Regards,
Dean



Re: Bug in numeric multiplication

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 17 November 2015 at 23:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not sure that it's provably okay though.  The loose ends are to show
>> that div[qi] can't overflow an int during the divisor-subtraction step
>> and that "outercarry" remains bounded.   Testing suggests that outercarry
>> can't exceed INT_MAX/NBASE, but I don't see how to prove that.

> At the top of the loop we know that
>   Abs(div[qi]) <= maxdiv * (NBASE-1)
>                <= INT_MAX - INT_MAX/NBASE - 1
> Then we add Abs(qdigit) to maxdiv which potentially triggers a
> normalisation step. That step adds carry to div[qi], and we know that
>   Abs(carry) <= INT_MAX/NBASE + 1
> so the result is that Abs(div[qi]) <= INT_MAX -- i.e., it can't
> overflow at that point, but it could be on the verge of doing so.

Right.

> Then the divisor-subtraction step does
>   div[qi] -= qdigit * var2digits[0]
> That looks as though it could overflow, although actually I would
> expect qdigit to have the same sign as div[qi], so that this would
> drive div[qi] towards zero.

But if outercarry is nonzero, then qdigit is going to be primarily driven
by outercarry not div[qi], so I'm afraid it's mistaken to rely on it
having the right sign to cause cancellation.

> If you didn't want to rely on that though,
> you could take advantage of the fact that this new value of div[qi] is
> only needed for outercarry, so you could modify the
> divisor-subtraction step to skip div[qi]:
> and fold the most significant digit of the divisor-subtraction step
> into the computation of outercarry

Cute idea.  Another thought I'd had is that we could change the carry
propagation loop limits so that div[qi] is normalized along with the other
digits.  Then we'd have a carry leftover at the end of the loop, but we
could add it to outercarry.  That makes the argument that div[qi] does
not overflow the same as for the other digits.

However, I kind of like your idea of postponing the div[qi] subtraction
to the outercarry update, because then it's very direct to see that
that update should result in near-total cancellation.

> so outercarry ought to be driven towards zero, as the comment says. It
> certainly doesn't look like it will grow without bounds, but I haven't
> attempted to work out what its bounds actually are.

I'm kind of stuck on that too.  I did some experimentation by tracking
maximum values of outercarry in the regression tests (including
numeric_big) and did not see it get larger than a couple hundred thousand,
ie more or less INT_MAX/NBASE.  But I don't have an argument as to why
that would be the limit.

Another issue here is that with outercarry added into the qdigit
computation, it's no longer clear what the bounds of qdigit itself are,
so is it really safe to assume that "div[qi + i] -= qdigit * var2digits[i]"
can't overflow?  Stated in other terms, why are we sure that the new
maxdiv value computed at the bottom of the carry propagation stanza is
within the safe range?
        regards, tom lane



Re: Bug in numeric multiplication

From
Tom Lane
Date:
I wrote:
> I'm kind of stuck on that too.  I did some experimentation by tracking
> maximum values of outercarry in the regression tests (including
> numeric_big) and did not see it get larger than a couple hundred thousand,
> ie more or less INT_MAX/NBASE.  But I don't have an argument as to why
> that would be the limit.

A bit of progress on this: I've found a test case, namely

select
sqrt(99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999.0099999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999);

If one inserts no-overflow Asserts into div_var_fast, this case will trip
them, specifically this:

        /*
         * The dividend digit we are about to replace might still be nonzero.
         * Fold it into the next digit position.  We don't need to worry about
         * overflow here since this should nearly cancel with the subtraction
         * of the divisor.
         */
+       Assert(Abs(div[qi]) <= INT_MAX/NBASE);
        div[qi + 1] += div[qi] * NBASE;

Unfortunately, there isn't any SQL-visible misbehavior in this example,
because the loop in sqrt_var is more or less self-correcting for minor
errors (and this example produces bogus results only in very low-order
digits).  Most of the other calls of div_var_fast give it inputs that are
even harder to control than sqrt_var's, so finding something that did
produce visibly wrong results might take a whole lot of trial and error.

Still, this proves that we are onto a real problem.

> Another issue here is that with outercarry added into the qdigit
> computation, it's no longer clear what the bounds of qdigit itself are,

I concluded that that particular issue is a red herring: qdigit should
always be a fairly accurate estimate of the next output digit, so it
cannot fall very far outside the range 0..NBASE-1.  Testing confirms that
the range seen in the regression tests is exactly -1 to 10000, which is
what you'd expect since there can be some roundoff error.

Also, after further thought I've been able to construct an argument why
outercarry stays bounded.  See what you think of the comments in the
attached patch, which incorporates your ideas about postponing the div[qi]
calculation.

            regards, tom lane

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index dcdc5cf..2c6a3f9 100644
*** a/src/backend/utils/adt/numeric.c
--- b/src/backend/utils/adt/numeric.c
*************** div_var_fast(NumericVar *var1, NumericVa
*** 6186,6192 ****
      double        fdividend,
                  fdivisor,
                  fdivisorinverse,
!                 fquotient;
      int            qi;
      int            i;

--- 6186,6193 ----
      double        fdividend,
                  fdivisor,
                  fdivisorinverse,
!                 fquotient,
!                 outercarry;
      int            qi;
      int            i;

*************** div_var_fast(NumericVar *var1, NumericVa
*** 6274,6289 ****
       * To avoid overflow in maxdiv itself, it represents the max absolute
       * value divided by NBASE-1, ie, at the top of the loop it is known that
       * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
       */
      maxdiv = 1;

      /*
!      * Outer loop computes next quotient digit, which will go into div[qi]
       */
      for (qi = 0; qi < div_ndigits; qi++)
      {
          /* Approximate the current dividend value */
!         fdividend = (double) div[qi];
          for (i = 1; i < 4; i++)
          {
              fdividend *= NBASE;
--- 6275,6297 ----
       * To avoid overflow in maxdiv itself, it represents the max absolute
       * value divided by NBASE-1, ie, at the top of the loop it is known that
       * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
+      *
+      * Note that maxdiv only includes digit positions that are still part of
+      * the dividend, ie, strictly speaking the above holds only for div[i]
+      * where i >= qi, at the top of the loop.
       */
      maxdiv = 1;

      /*
!      * Outer loop computes next quotient digit, which will go into div[qi].
!      * "outercarry" represents high-order dividend digits carried across
!      * iterations.
       */
+     outercarry = 0;
      for (qi = 0; qi < div_ndigits; qi++)
      {
          /* Approximate the current dividend value */
!         fdividend = outercarry * NBASE + (double) div[qi];
          for (i = 1; i < 4; i++)
          {
              fdividend *= NBASE;
*************** div_var_fast(NumericVar *var1, NumericVa
*** 6320,6340 ****
                          carry = 0;
                      div[i] = newdig;
                  }
!                 newdig = div[qi] + carry;
!                 div[qi] = newdig;

                  /*
                   * All the div[] digits except possibly div[qi] are now in the
!                  * range 0..NBASE-1.
                   */
!                 maxdiv = Abs(newdig) / (NBASE - 1);
!                 maxdiv = Max(maxdiv, 1);

                  /*
                   * Recompute the quotient digit since new info may have
                   * propagated into the top four dividend digits
                   */
!                 fdividend = (double) div[qi];
                  for (i = 1; i < 4; i++)
                  {
                      fdividend *= NBASE;
--- 6328,6347 ----
                          carry = 0;
                      div[i] = newdig;
                  }
!                 div[qi] += carry;

                  /*
                   * All the div[] digits except possibly div[qi] are now in the
!                  * range 0..NBASE-1; and we don't care about including div[qi]
!                  * anymore, so we can reset maxdiv to 1.
                   */
!                 maxdiv = 1;

                  /*
                   * Recompute the quotient digit since new info may have
                   * propagated into the top four dividend digits
                   */
!                 fdividend = outercarry * NBASE + (double) div[qi];
                  for (i = 1; i < 4; i++)
                  {
                      fdividend *= NBASE;
*************** div_var_fast(NumericVar *var1, NumericVa
*** 6348,6370 ****
                  maxdiv += Abs(qdigit);
              }

!             /* Subtract off the appropriate multiple of the divisor */
              if (qdigit != 0)
              {
                  int            istop = Min(var2ndigits, div_ndigits - qi + 1);

!                 for (i = 0; i < istop; i++)
                      div[qi + i] -= qdigit * var2digits[i];
              }
          }

          /*
           * The dividend digit we are about to replace might still be nonzero.
!          * Fold it into the next digit position.  We don't need to worry about
!          * overflow here since this should nearly cancel with the subtraction
!          * of the divisor.
           */
!         div[qi + 1] += div[qi] * NBASE;

          div[qi] = qdigit;
      }
--- 6355,6412 ----
                  maxdiv += Abs(qdigit);
              }

!             /*
!              * Subtract off the appropriate multiple of the divisor.
!              *
!              * In principle we should update div[qi] along with the other
!              * digits, but since we're about to replace div[qi] with qdigit,
!              * we prefer to fold that subtraction step into the outercarry
!              * update below.  That way we can do it in float arithmetic and
!              * avoid any worry about overflow due to div[qi] possibly being
!              * larger than the other digits after a carry propagation.
!              */
              if (qdigit != 0)
              {
                  int            istop = Min(var2ndigits, div_ndigits - qi + 1);

!                 for (i = 1; i < istop; i++)
                      div[qi + i] -= qdigit * var2digits[i];
              }
          }

          /*
           * The dividend digit we are about to replace might still be nonzero.
!          * Carry it forward to future division steps in "outercarry", being
!          * sure to apply the last divisor-subtraction step too.
!          *
!          * Although it might appear that outercarry could grow large enough to
!          * lose precision, this is not actually possible because the
!          * higher-order terms in qdigit are
!          *
!          * qdigit ~= (outercarry * NBASE + div[qi]) / var2digits[0]
!          *
!          * So the outercarry calculation would yield zero were it not for the
!          * lower-order terms and the fact that qdigit was truncated to an int.
!          * Truncation could let the result range from 0 to var2digits[0] - 1.
!          * As for the lower-order terms: including additional digits from the
!          * divisor cannot increase qdigit compared to the above approximation,
!          * and can decrease it by at most a factor of two (this occurs when
!          * var2digits[0] = 1 and var2digits[1] is NBASE-1).  So this
!          * adjustment may mean that outercarry is not decreased all the way to
!          * zero, but it won't be increased either.  The effect of including
!          * additional dividend digits is large only if outercarry is zero, and
!          * becomes negligible as outercarry increases.  If outercarry is more
!          * than INT_MAX/NBASE then it's impossible for div[qi + 1] to create
!          * more than 1/NBASE relative change in qdigit, ie qdigit is at most
!          * off by 1, so that the error is comparable to truncation error.
!          * Testing suggests that the maximum observed outercarry is somewhat
!          * less than INT_MAX/NBASE.  (However, since outercarry will only be
!          * used in floating-point calculations, we might as well compute it as
!          * a double and avoid any worry about intermediate overflows in this
!          * expression.)
           */
!         outercarry = outercarry * NBASE + (double) div[qi]
!             - (double) (qdigit * var2digits[0]);

          div[qi] = qdigit;
      }
*************** div_var_fast(NumericVar *var1, NumericVa
*** 6372,6378 ****
      /*
       * Approximate and store the last quotient digit (div[div_ndigits])
       */
!     fdividend = (double) div[qi];
      for (i = 1; i < 4; i++)
          fdividend *= NBASE;
      fquotient = fdividend * fdivisorinverse;
--- 6414,6420 ----
      /*
       * Approximate and store the last quotient digit (div[div_ndigits])
       */
!     fdividend = outercarry * NBASE + (double) div[qi];
      for (i = 1; i < 4; i++)
          fdividend *= NBASE;
      fquotient = fdividend * fdivisorinverse;

Re: Bug in numeric multiplication

From
Dean Rasheed
Date:
On 18 November 2015 at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> A bit of progress on this: I've found a test case, namely
>
> select sqrt(999999...
>

Wow.

> Still, this proves that we are onto a real problem.
>

Agreed. I had a go at turning that example into something using
log(base, num) so that the result would be visible in a pure SQL test,
but I didn't have any luck.

>> Another issue here is that with outercarry added into the qdigit
>> computation, it's no longer clear what the bounds of qdigit itself are,
>
> I concluded that that particular issue is a red herring: qdigit should
> always be a fairly accurate estimate of the next output digit, so it
> cannot fall very far outside the range 0..NBASE-1.  Testing confirms that
> the range seen in the regression tests is exactly -1 to 10000, which is
> what you'd expect since there can be some roundoff error.
>

Right, I came to the same conclusion.

> Also, after further thought I've been able to construct an argument why
> outercarry stays bounded.  See what you think of the comments in the
> attached patch, which incorporates your ideas about postponing the div[qi]
> calculation.

I think the logic is sound, but I worry that that comment is going to
be very difficult to follow.

I had a go at trying to find a simpler approach and came up with the
attached, which computes the value intended for div[qi+1] near the top
of the loop (using doubles) so that it can detect if it will overflow,
and if so it enters the normalisation block. It still needs some work
to prove that the normalised value for fnextdigit won't overflow, but
it feels like a promising, easier-to-follow approach to me. What do
you think?

Regards,
Dean

Attachment

Re: Bug in numeric multiplication

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 18 November 2015 at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Still, this proves that we are onto a real problem.

> Agreed. I had a go at turning that example into something using
> log(base, num) so that the result would be visible in a pure SQL test,
> but I didn't have any luck.

I experimented with that idea some, and found a test case that would
trigger the Assert during log_var's final div_var_fast call:

select log(

exp(.999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999995050012831598249352382434889825483159817)
,

exp(.99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999009999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999));

However, the emitted answer was the same with or without my proposed
patch.  So I dug deeper by putting in some debug printf's, and found that
the overflow occurs at this point:

div[81] will overflow: div[qi] = 218943 qdigit = 7673 maxdiv = 210375
div[]:0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
00000000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 -001 9999 9999 9999
50499871 6840 1750 6476 1756 5110 1745 1684 0183 0860 9567 3786 7660 2671 2843 5974 6515 4013 7886 0978 7230 5372 8162
70772013 6185 3746 3095 8528 1595 3071 7541 7920 7950 218943 -2103498897 -2103499629 -2103499629 -2103499629
-2103504578

Note that div[qi+1], and indeed all the remaining dividend digits, are
large negative values.  This is what you'd expect if the carry propagation
step hasn't run for awhile, which is a precondition for div[qi] being
large enough to cause an issue.  When we compute 218943 * 10000, we will
indeed get an overflow, and the result will wrap around to some large
negative value (2^32 less than it should be).  Then we will add that to
div[qi+1], and we'll get *another* overflow, wrapping what nominally
should have been a negative sum around to a positive value (2^32 more than
it should be).  So the two overflows cancel and we get exactly the correct
new value of div[qi+1].

I do not know whether it's possible to devise a test case where you don't
get offsetting overflows.  It may be that there's no operational bug here.
Still, the code is surely not behaving per face value.

> I had a go at trying to find a simpler approach and came up with the
> attached, which computes the value intended for div[qi+1] near the top
> of the loop (using doubles) so that it can detect if it will overflow,
> and if so it enters the normalisation block. It still needs some work
> to prove that the normalised value for fnextdigit won't overflow, but
> it feels like a promising, easier-to-follow approach to me. What do
> you think?

I'll look at this later ...
        regards, tom lane



Re: Bug in numeric multiplication

From
Tom Lane
Date:
I wrote:
> Note that div[qi+1], and indeed all the remaining dividend digits, are
> large negative values.  This is what you'd expect if the carry propagation
> step hasn't run for awhile, which is a precondition for div[qi] being
> large enough to cause an issue.  When we compute 218943 * 10000, we will
> indeed get an overflow, and the result will wrap around to some large
> negative value (2^32 less than it should be).  Then we will add that to
> div[qi+1], and we'll get *another* overflow, wrapping what nominally
> should have been a negative sum around to a positive value (2^32 more than
> it should be).  So the two overflows cancel and we get exactly the correct
> new value of div[qi+1].

> I do not know whether it's possible to devise a test case where you don't
> get offsetting overflows.  It may be that there's no operational bug here.
> Still, the code is surely not behaving per face value.

After further thought I've pretty well convinced myself that there is
indeed no observable bug, at least as long as you assume that overflow
within the multiplication will behave as stated above.  The proof goes
like this:

We already know that the divisor-subtraction step cannot cause an overflow
(modulo the question of div[qi] possibly exceeding the maxdiv limit,
which I'll get back to).  So what is at stake is just the possibility of
overflow in the final update that transfers leftover digits from div[qi]
to div[qi+1].

To analyze that, consider just the first two dividend and divisor terms in
the qdigit calculation, that is during each loop iteration approximate
qdigit asqdigit ~= trunc((div[qi]*NBASE + div[qi+1]) /                (var2digits[0]*NBASE + var2digits[1]))
This holds whether or not we do a carry propagation step.  Now write
what the divisor-subtraction step updates div[qi] to asdiv[qi]' = div[qi] - qdigit * var2digits[0]
and the updated div[qi+1] asdiv[qi+1]' = div[qi+1] - qdigit * var2digits[1]
So, if we temporarily disregard the possibility of overflow within the
final calculationdiv[qi+1]'' = div[qi+1]' + div[qi]'*NBASE
we can see that div[qi+1]'' is going to be= div[qi+1]' + div[qi]'*NBASE= div[qi+1] - qdigit * var2digits[1] +  (div[qi]
-qdigit * var2digits[0])*NBASE= div[qi]*NBASE + div[qi+1] -  qdigit * (var2digits[0]*NBASE + var2digits[1])
 
Comparing that to the approximate value of qdigit, we can see that
what we've got here is a modulo calculation, and the value of div[qi+1]''
is going to cancel to zero except for a remainder modulo
var2digits[0]*NBASE + var2digits[1], which of course must fall between 0
and NBASE*NBASE+NBASE (since the truncation is towards minus infinity).

Now of course this is only an approximation.  The main source of error is
that we've omitted the lower-order dividend terms.  Including div[qi+2]
could change the result by at most about INT_MAX/NBASE (which is the most
it could add to the numerator in the qdigit expression above, which will
propagate more or less linearly to div[qi+1]'').  Similarly, adding
div[qi+3] could add at most INT_MAX/NBASE/NBASE.  So we aren't anywhere
near the overflow threshold.  Including the lower-order divisor terms
could change the value of qdigit by a multiplicative factor of no more
than about 1/NBASE.  Lastly, roundoff error in the floating-point part
of the calculation could cause qdigit to be off by one count either
way.  None of these effects are going to let the final div[qi+1] value
get to more than two or three times NBASE squared, which is still
an order of magnitude less than INT_MAX.

Therefore, the final div[qi+1] value cannot overflow an int, and even
though it might be larger than what maxdiv would suggest, it's not going
to be large enough to cause overflow in the next loop iteration either.
It obviously won't cause overflow during carry propagation, and as for
the next divisor-subtraction step, we can do an analysis similar to the
above but approximating qdigit with just these terms:qdigit ~= trunc((div[qi] + div[qi+1]/NBASE) / var2digits[0])
Plugging that intodiv[qi]' = div[qi] - qdigit * var2digits[0]
shows that the updated div[qi] in any loop iteration is going to be just
about -div[qi+1]/NBASE, plus a truncation term that's between 0 and
var2digits[0], plus lower-order terms that aren't going to get you very
much past INT_MAX/NBASE.  So div[qi]' is never an overflow hazard in any
loop iteration.

In short, it's impossible for the end result of the div[qi+1] update
calculation to overflow, or even to get large enough to create a problem
in the next loop iteration.  However, the intermediate result
div[qi]*NBASE can overflow as per the example I showed before.

We could just ignore this on the grounds that it doesn't matter given sane
behavior of integer arithmetic.  Or, if we wanted to be more paranoid, we
could do the multiplication-and-addition in int64 arithmetic; but that
would probably slow things down a bit.

I think all this deserves some code comments, but I'm not sure whether
we need to bother with casting to int64.  Thoughts?
        regards, tom lane



Re: Bug in numeric multiplication

From
Dean Rasheed
Date:
On 24 November 2015 at 16:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After further thought I've pretty well convinced myself that there is
> indeed no observable bug, at least as long as you assume that overflow
> within the multiplication will behave as stated above.  The proof goes
> like this:
>
> We already know that the divisor-subtraction step cannot cause an overflow
> (modulo the question of div[qi] possibly exceeding the maxdiv limit,
> which I'll get back to).  So what is at stake is just the possibility of
> overflow in the final update that transfers leftover digits from div[qi]
> to div[qi+1].
>
> To analyze that, consider just the first two dividend and divisor terms in
> the qdigit calculation, that is during each loop iteration approximate
> qdigit as
>         qdigit ~= trunc((div[qi]*NBASE + div[qi+1]) /
>                         (var2digits[0]*NBASE + var2digits[1]))
> This holds whether or not we do a carry propagation step.  Now write
> what the divisor-subtraction step updates div[qi] to as
>         div[qi]' = div[qi] - qdigit * var2digits[0]
> and the updated div[qi+1] as
>         div[qi+1]' = div[qi+1] - qdigit * var2digits[1]
> So, if we temporarily disregard the possibility of overflow within the
> final calculation
>         div[qi+1]'' = div[qi+1]' + div[qi]'*NBASE
> we can see that div[qi+1]'' is going to be
>         = div[qi+1]' + div[qi]'*NBASE
>         = div[qi+1] - qdigit * var2digits[1] +
>           (div[qi] - qdigit * var2digits[0])*NBASE
>         = div[qi]*NBASE + div[qi+1] -
>           qdigit * (var2digits[0]*NBASE + var2digits[1])
> Comparing that to the approximate value of qdigit, we can see that
> what we've got here is a modulo calculation, and the value of div[qi+1]''
> is going to cancel to zero except for a remainder modulo
> var2digits[0]*NBASE + var2digits[1], which of course must fall between 0
> and NBASE*NBASE+NBASE (since the truncation is towards minus infinity).
>

Yes that makes sense.


> Now of course this is only an approximation.  The main source of error is
> that we've omitted the lower-order dividend terms.  Including div[qi+2]
> could change the result by at most about INT_MAX/NBASE (which is the most
> it could add to the numerator in the qdigit expression above, which will
> propagate more or less linearly to div[qi+1]'').  Similarly, adding
> div[qi+3] could add at most INT_MAX/NBASE/NBASE.  So we aren't anywhere
> near the overflow threshold.  Including the lower-order divisor terms
> could change the value of qdigit by a multiplicative factor of no more
> than about 1/NBASE.  Lastly, roundoff error in the floating-point part
> of the calculation could cause qdigit to be off by one count either
> way.  None of these effects are going to let the final div[qi+1] value
> get to more than two or three times NBASE squared, which is still
> an order of magnitude less than INT_MAX.
>

Right. In fact I think div[qi+1] is even more constrained than that (see below).


> Therefore, the final div[qi+1] value cannot overflow an int, and even
> though it might be larger than what maxdiv would suggest, it's not going
> to be large enough to cause overflow in the next loop iteration either.
> It obviously won't cause overflow during carry propagation, and as for
> the next divisor-subtraction step, we can do an analysis similar to the
> above but approximating qdigit with just these terms:
>         qdigit ~= trunc((div[qi] + div[qi+1]/NBASE) / var2digits[0])
> Plugging that into
>         div[qi]' = div[qi] - qdigit * var2digits[0]
> shows that the updated div[qi] in any loop iteration is going to be just
> about -div[qi+1]/NBASE, plus a truncation term that's between 0 and
> var2digits[0], plus lower-order terms that aren't going to get you very
> much past INT_MAX/NBASE.  So div[qi]' is never an overflow hazard in any
> loop iteration.
>
> In short, it's impossible for the end result of the div[qi+1] update
> calculation to overflow, or even to get large enough to create a problem
> in the next loop iteration.  However, the intermediate result
> div[qi]*NBASE can overflow as per the example I showed before.
>

Agreed.

I tried analysing this in a different way, by considering the maximum
possible error in the floating point computations. fdividend takes up
to 4 digits from the dividend:
 fdividend = div[qi] * NBASE^3 + div[qi+1] * NBASE^2           + div[qi+2] * NBASE + div[qi+3]

and the digits being discarded are limited to a little less than
+/-INT_MAX, so the error in fdividend due to discarding digits is at
most +/-INT_MAX/NBASE. fdivisor is computed in a similar way, except
that the var2 digits are in the range [0,NBASE-1], so fdivisor is in
the range [NBASE^3, NBASE^4-1]. Therefore the maximum error in the
computation of fquotient is +/-INT_MAX/NBASE^4, which is less than 1.
So the most this can do is to cause a single integer boundary to be
crossed, and therefore qdigit will be within +/-1 of the correct
result. That's consistent with the observation that in practice qdigit
always falls in the range [-1,10000].

In fact, since we round down when computing qdigit, it should always
be correct or 1 smaller than the correct result, but the "correct"
result in this context may be equal to 10000, to compensate for an
underestimate in the previously calculated digit.

It then follows that the result of the divisor subtraction step can be
to have up to one extra copy of the divisor added to the remainder,
and this will contribute up to an extra var2digits[0]*NBASE^4 +
var2digits[1]*NBASE^3 to fdividend in the next iteration, making
fdividend up to around NBASE*fdivisor.

So fdividend will always be less than around NBASE^5, which means that
at the top of the loop div[qi] is always less than
NBASE^2+INT_MAX/NBASE (since the next digit of the divisor could be
large and negative, down to -INT_MAX).

So, in other words, the limit on the value of div[qi+1] computed at
the bottom of the loop is around NBASE^2+INT_MAX/NBASE=100214748, and
so it can't possibly overflow a 32-bit integer.

I added some logging and ran the self tests plus a few other examples
analogous to your example upthread and the largest value for div[qi+1]
logged was 100210148, which is consistent with that bound.


> We could just ignore this on the grounds that it doesn't matter given sane
> behavior of integer arithmetic.  Or, if we wanted to be more paranoid, we
> could do the multiplication-and-addition in int64 arithmetic; but that
> would probably slow things down a bit.
>

I doubt that a single int64 computation would make any noticeable
performance difference to this computation, given all the other code
in the loop, but I think it's probably unnecessary.


> I think all this deserves some code comments, but I'm not sure whether
> we need to bother with casting to int64.  Thoughts?
>
>                         regards, tom lane

I think probably just a comment is OK.

Regards,
Dean



Re: Bug in numeric multiplication

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 24 November 2015 at 16:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... None of these effects are going to let the final div[qi+1] value
>> get to more than two or three times NBASE squared, which is still
>> an order of magnitude less than INT_MAX.

> Right. In fact I think div[qi+1] is even more constrained than that (see below).

Thanks for the additional analysis and testing.

>> I think all this deserves some code comments, but I'm not sure whether
>> we need to bother with casting to int64.  Thoughts?

> I think probably just a comment is OK.

I'm inclined to agree.  The only reason I'd worry at all is that the
compiler guys are getting ever more aggressive about claiming that any
integer overflow can result in arbitrarily-broken behavior.  However,
we already have a lot of other code that depends on -fwrapv or equivalent
overflow behavior, so I doubt that this is the first place to worry
about it.  The underlying hardware is certainly going to behave as we
wish, and it's pretty hard to see how a compiler could "optimize" the
assembly code in a way that would break such a simple calculation.

I'll go add some comments and call it good.
        regards, tom lane