Thread: Avoid choose invalid number of partitions (src/backend/executor/nodeAgg.c)

Avoid choose invalid number of partitions (src/backend/executor/nodeAgg.c)

From
Ranier Vilela
Date:
Hi,

Not per Coverity.

hash_choose_num_partitions function has issues.
There are at least two path calls made with used_bits = 0.
See at hashagg_spill_init.

To confirm, I run this code on cpp.sh:
int main()
{
   int npartitions = 0;
   int used_bits = 0;
   int partition_bits = 0;
   int i;
   
   for(i = 0; i <= 32; i++) {
 /* make sure that we don't exhaust the hash bits */
 if (partition_bits + used_bits >= 32)
 partition_bits = 32 - used_bits;  
      npartitions = 1L << partition_bits;
      printf("used_bits=%d\n", used_bits);
      printf("partition_bits=%d\n", partition_bits);
      printf("npartitions=%d\n\n", npartitions);
      partition_bits++;
   }
 }

Whose output would be:
used_bits=0
partition_bits=0
npartitions=1

used_bits=0
partition_bits=1
npartitions=2

used_bits=0
partition_bits=2
npartitions=4

used_bits=0
partition_bits=3
npartitions=8

used_bits=0
partition_bits=4
npartitions=16

used_bits=0
partition_bits=5
npartitions=32

used_bits=0
partition_bits=6
npartitions=64

used_bits=0
partition_bits=7
npartitions=128

used_bits=0
partition_bits=8
npartitions=256

used_bits=0
partition_bits=9
npartitions=512

used_bits=0
partition_bits=10
npartitions=1024

used_bits=0
partition_bits=11
npartitions=2048

used_bits=0
partition_bits=12
npartitions=4096

used_bits=0
partition_bits=13
npartitions=8192

used_bits=0
partition_bits=14
npartitions=16384

used_bits=0
partition_bits=15
npartitions=32768

used_bits=0
partition_bits=16
npartitions=65536

used_bits=0
partition_bits=17
npartitions=131072

used_bits=0
partition_bits=18
npartitions=262144

used_bits=0
partition_bits=19
npartitions=524288

used_bits=0
partition_bits=20
npartitions=1048576

used_bits=0
partition_bits=21
npartitions=2097152

used_bits=0
partition_bits=22
npartitions=4194304

used_bits=0
partition_bits=23
npartitions=8388608

used_bits=0
partition_bits=24
npartitions=16777216

used_bits=0
partition_bits=25
npartitions=33554432

used_bits=0
partition_bits=26
npartitions=67108864

used_bits=0
partition_bits=27
npartitions=134217728

used_bits=0
partition_bits=28
npartitions=268435456

used_bits=0
partition_bits=29
npartitions=536870912

used_bits=0
partition_bits=30
npartitions=1073741824

used_bits=0
partition_bits=31
npartitions=-2147483648

used_bits=0
partition_bits=32
npartitions=0

With partition_bits > 24, is very problematic, but with 31 and 32, it becomes a bug.
With npartitions = -2147483648 and 0, the function hashagg_spill_init,
will generate an operation that is undefined according to the rules of C.
spill->mask = (npartitions - 1) << spill->shift;

On Windows 64 bits (HEAD) fails with partition_prune:
parallel group (11 tests):  reloptions hash_part partition_info explain compression resultcache indexing partition_join partition_aggregate partition_prune tuplesort
     partition_join               ... ok         3495 ms
     partition_prune              ... FAILED     4926 ms

diff -w -U3 C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out
--- C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out 2021-06-23 11:11:26.489575100 -0300
+++ C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out 2021-06-29 10:54:43.103775700 -0300
@@ -2660,7 +2660,7 @@
 --------------------------------------------------------------------------
  Nested Loop (actual rows=3 loops=1)
    ->  Seq Scan on tbl1 (actual rows=5 loops=1)
-   ->  Append (actual rows=1 loops=5)
+   ->  Append (actual rows=0 loops=5)
          ->  Index Scan using tprt1_idx on tprt_1 (never executed)
                Index Cond: (col1 = tbl1.col1)
          ->  Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)

With patch attached:
parallel group (11 tests):  partition_info hash_part resultcache reloptions explain compression indexing partition_aggregate partition_join tuplesort partition_prune
     partition_join               ... ok         3013 ms
     partition_prune              ... ok         3959 ms

regards,
Ranier Vilela
Attachment

Re: Avoid choose invalid number of partitions (src/backend/executor/nodeAgg.c)

From
David Rowley
Date:
On Wed, 30 Jun 2021 at 02:33, Ranier Vilela <ranier.vf@gmail.com> wrote:
> hash_choose_num_partitions function has issues.
> There are at least two path calls made with used_bits = 0.
> See at hashagg_spill_init.

> On Windows 64 bits (HEAD) fails with partition_prune:
> parallel group (11 tests):  reloptions hash_part partition_info explain compression resultcache indexing
partition_joinpartition_aggregate partition_prune tuplesort
 
>      partition_join               ... ok         3495 ms
>      partition_prune              ... FAILED     4926 ms
>
> diff -w -U3 C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out
C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out
> --- C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out 2021-06-23 11:11:26.489575100 -0300
> +++ C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out 2021-06-29 10:54:43.103775700 -0300
> @@ -2660,7 +2660,7 @@
>  --------------------------------------------------------------------------
>   Nested Loop (actual rows=3 loops=1)
>     ->  Seq Scan on tbl1 (actual rows=5 loops=1)
> -   ->  Append (actual rows=1 loops=5)
> +   ->  Append (actual rows=0 loops=5)
>           ->  Index Scan using tprt1_idx on tprt_1 (never executed)
>                 Index Cond: (col1 = tbl1.col1)
>           ->  Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
>
> With patch attached:
> parallel group (11 tests):  partition_info hash_part resultcache reloptions explain compression indexing
partition_aggregatepartition_join tuplesort partition_prune
 
>      partition_join               ... ok         3013 ms
>      partition_prune              ... ok         3959 ms

This failure was reported to me along with this thread so I had a look at it.

Firstly, I'm a bit confused as to why you think making a change in
nodeAgg.c would have any effect on a plan that does not contain any
aggregate node.

As for the regression test failure. I can recreate it, but I did have
to install VS2019 version 16.9.3 from
https://docs.microsoft.com/en-us/visualstudio/releases/2019/history

This basically boils down to the 16.9.3 compiler outputting "0" for:

#include <stdio.h>

int main(void)
{
    printf("%.0f\n", 0.59999999999999998);
    return 0;
}

but we expect it to output "1".

We name use of the provided sprintf() function in snprintf.c line 1188 with:

vallen = sprintf(convert, fmt, prec, value);

I don't see the problem in more recent versions of VS2019, but I
didn't go to the trouble of figuring out exactly which version this
was fixed in.

David



Re: Avoid choose invalid number of partitions (src/backend/executor/nodeAgg.c)

From
Ranier Vilela
Date:
Em seg., 30 de ago. de 2021 às 07:44, David Rowley <dgrowleyml@gmail.com> escreveu:
On Wed, 30 Jun 2021 at 02:33, Ranier Vilela <ranier.vf@gmail.com> wrote:
> hash_choose_num_partitions function has issues.
> There are at least two path calls made with used_bits = 0.
> See at hashagg_spill_init.

> On Windows 64 bits (HEAD) fails with partition_prune:
> parallel group (11 tests):  reloptions hash_part partition_info explain compression resultcache indexing partition_join partition_aggregate partition_prune tuplesort
>      partition_join               ... ok         3495 ms
>      partition_prune              ... FAILED     4926 ms
>
> diff -w -U3 C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out
> --- C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out 2021-06-23 11:11:26.489575100 -0300
> +++ C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out 2021-06-29 10:54:43.103775700 -0300
> @@ -2660,7 +2660,7 @@
>  --------------------------------------------------------------------------
>   Nested Loop (actual rows=3 loops=1)
>     ->  Seq Scan on tbl1 (actual rows=5 loops=1)
> -   ->  Append (actual rows=1 loops=5)
> +   ->  Append (actual rows=0 loops=5)
>           ->  Index Scan using tprt1_idx on tprt_1 (never executed)
>                 Index Cond: (col1 = tbl1.col1)
>           ->  Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
>
> With patch attached:
> parallel group (11 tests):  partition_info hash_part resultcache reloptions explain compression indexing partition_aggregate partition_join tuplesort partition_prune
>      partition_join               ... ok         3013 ms
>      partition_prune              ... ok         3959 ms

This failure was reported to me along with this thread so I had a look at it.
Thanks.


Firstly, I'm a bit confused as to why you think making a change in
nodeAgg.c would have any effect on a plan that does not contain any
aggregate node.
Yeah, they are unrelated.
For some reason, when checking the regress, partion_prune was ok and I mistakenly made a connection with the changes, which is wrong.


As for the regression test failure. I can recreate it, but I did have
to install VS2019 version 16.9.3 from
https://docs.microsoft.com/en-us/visualstudio/releases/2019/history

This basically boils down to the 16.9.3 compiler outputting "0" for:

#include <stdio.h>

int main(void)
{
    printf("%.0f\n", 0.59999999999999998);
    return 0;
}

but we expect it to output "1".

We name use of the provided sprintf() function in snprintf.c line 1188 with:

vallen = sprintf(convert, fmt, prec, value);

I don't see the problem in more recent versions of VS2019, but I
didn't go to the trouble of figuring out exactly which version this
was fixed in.
Regarding this test, with the last msvc, compiled for Debug, it still occurs:
     partition_join               ... ok         4267 ms
     partition_prune              ... FAILED     5270 ms
     reloptions                   ... ok          755 ms
     hash_part                    ... ok          494 ms

I still believe it's the compiler problem.

regards,
Ranier Vilela