Thread: Avoid choose invalid number of partitions (src/backend/executor/nodeAgg.c)
Hi,
Not per Coverity.
hash_choose_num_partitions function has issues.
There are at least two path calls made with used_bits = 0.
See at hashagg_spill_init.
See at hashagg_spill_init.
To confirm, I run this code on cpp.sh:
int main()
{
int npartitions = 0;
int used_bits = 0;
int partition_bits = 0;
int i;
for(i = 0; i <= 32; i++) {
/* make sure that we don't exhaust the hash bits */
if (partition_bits + used_bits >= 32)
partition_bits = 32 - used_bits;
npartitions = 1L << partition_bits;
printf("used_bits=%d\n", used_bits);
printf("partition_bits=%d\n", partition_bits);
printf("npartitions=%d\n\n", npartitions);
partition_bits++;
}
}
{
int npartitions = 0;
int used_bits = 0;
int partition_bits = 0;
int i;
for(i = 0; i <= 32; i++) {
/* make sure that we don't exhaust the hash bits */
if (partition_bits + used_bits >= 32)
partition_bits = 32 - used_bits;
npartitions = 1L << partition_bits;
printf("used_bits=%d\n", used_bits);
printf("partition_bits=%d\n", partition_bits);
printf("npartitions=%d\n\n", npartitions);
partition_bits++;
}
}
Whose output would be:
used_bits=0
partition_bits=0
npartitions=1
used_bits=0
partition_bits=1
npartitions=2
used_bits=0
partition_bits=2
npartitions=4
used_bits=0
partition_bits=3
npartitions=8
used_bits=0
partition_bits=4
npartitions=16
used_bits=0
partition_bits=5
npartitions=32
used_bits=0
partition_bits=6
npartitions=64
used_bits=0
partition_bits=7
npartitions=128
used_bits=0
partition_bits=8
npartitions=256
used_bits=0
partition_bits=9
npartitions=512
used_bits=0
partition_bits=10
npartitions=1024
used_bits=0
partition_bits=11
npartitions=2048
used_bits=0
partition_bits=12
npartitions=4096
used_bits=0
partition_bits=13
npartitions=8192
used_bits=0
partition_bits=14
npartitions=16384
used_bits=0
partition_bits=15
npartitions=32768
used_bits=0
partition_bits=16
npartitions=65536
used_bits=0
partition_bits=17
npartitions=131072
used_bits=0
partition_bits=18
npartitions=262144
used_bits=0
partition_bits=19
npartitions=524288
used_bits=0
partition_bits=20
npartitions=1048576
used_bits=0
partition_bits=21
npartitions=2097152
used_bits=0
partition_bits=22
npartitions=4194304
used_bits=0
partition_bits=23
npartitions=8388608
used_bits=0
partition_bits=24
npartitions=16777216
used_bits=0
partition_bits=25
npartitions=33554432
used_bits=0
partition_bits=26
npartitions=67108864
used_bits=0
partition_bits=27
npartitions=134217728
used_bits=0
partition_bits=28
npartitions=268435456
used_bits=0
partition_bits=29
npartitions=536870912
used_bits=0
partition_bits=30
npartitions=1073741824
used_bits=0
partition_bits=31
npartitions=-2147483648
used_bits=0
partition_bits=32
npartitions=0
partition_bits=0
npartitions=1
used_bits=0
partition_bits=1
npartitions=2
used_bits=0
partition_bits=2
npartitions=4
used_bits=0
partition_bits=3
npartitions=8
used_bits=0
partition_bits=4
npartitions=16
used_bits=0
partition_bits=5
npartitions=32
used_bits=0
partition_bits=6
npartitions=64
used_bits=0
partition_bits=7
npartitions=128
used_bits=0
partition_bits=8
npartitions=256
used_bits=0
partition_bits=9
npartitions=512
used_bits=0
partition_bits=10
npartitions=1024
used_bits=0
partition_bits=11
npartitions=2048
used_bits=0
partition_bits=12
npartitions=4096
used_bits=0
partition_bits=13
npartitions=8192
used_bits=0
partition_bits=14
npartitions=16384
used_bits=0
partition_bits=15
npartitions=32768
used_bits=0
partition_bits=16
npartitions=65536
used_bits=0
partition_bits=17
npartitions=131072
used_bits=0
partition_bits=18
npartitions=262144
used_bits=0
partition_bits=19
npartitions=524288
used_bits=0
partition_bits=20
npartitions=1048576
used_bits=0
partition_bits=21
npartitions=2097152
used_bits=0
partition_bits=22
npartitions=4194304
used_bits=0
partition_bits=23
npartitions=8388608
used_bits=0
partition_bits=24
npartitions=16777216
used_bits=0
partition_bits=25
npartitions=33554432
used_bits=0
partition_bits=26
npartitions=67108864
used_bits=0
partition_bits=27
npartitions=134217728
used_bits=0
partition_bits=28
npartitions=268435456
used_bits=0
partition_bits=29
npartitions=536870912
used_bits=0
partition_bits=30
npartitions=1073741824
used_bits=0
partition_bits=31
npartitions=-2147483648
used_bits=0
partition_bits=32
npartitions=0
With partition_bits > 24, is very problematic, but with 31 and 32, it becomes a bug.
With npartitions = -2147483648 and 0, the function hashagg_spill_init,
will generate an operation that is undefined according to the rules of C.
spill->mask = (npartitions - 1) << spill->shift;
spill->mask = (npartitions - 1) << spill->shift;
On Windows 64 bits (HEAD) fails with partition_prune:
parallel group (11 tests): reloptions hash_part partition_info explain compression resultcache indexing partition_join partition_aggregate partition_prune tuplesort
partition_join ... ok 3495 ms
partition_prune ... FAILED 4926 ms
partition_join ... ok 3495 ms
partition_prune ... FAILED 4926 ms
diff -w -U3 C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out
--- C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out 2021-06-23 11:11:26.489575100 -0300
+++ C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out 2021-06-29 10:54:43.103775700 -0300
@@ -2660,7 +2660,7 @@
--------------------------------------------------------------------------
Nested Loop (actual rows=3 loops=1)
-> Seq Scan on tbl1 (actual rows=5 loops=1)
- -> Append (actual rows=1 loops=5)
+ -> Append (actual rows=0 loops=5)
-> Index Scan using tprt1_idx on tprt_1 (never executed)
Index Cond: (col1 = tbl1.col1)
-> Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
--- C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out 2021-06-23 11:11:26.489575100 -0300
+++ C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out 2021-06-29 10:54:43.103775700 -0300
@@ -2660,7 +2660,7 @@
--------------------------------------------------------------------------
Nested Loop (actual rows=3 loops=1)
-> Seq Scan on tbl1 (actual rows=5 loops=1)
- -> Append (actual rows=1 loops=5)
+ -> Append (actual rows=0 loops=5)
-> Index Scan using tprt1_idx on tprt_1 (never executed)
Index Cond: (col1 = tbl1.col1)
-> Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
With patch attached:
parallel group (11 tests): partition_info hash_part resultcache reloptions explain compression indexing partition_aggregate partition_join tuplesort partition_prune
partition_join ... ok 3013 ms
partition_prune ... ok 3959 ms
partition_join ... ok 3013 ms
partition_prune ... ok 3959 ms
regards,
Ranier Vilela
Attachment
Re: Avoid choose invalid number of partitions (src/backend/executor/nodeAgg.c)
From
David Rowley
Date:
On Wed, 30 Jun 2021 at 02:33, Ranier Vilela <ranier.vf@gmail.com> wrote: > hash_choose_num_partitions function has issues. > There are at least two path calls made with used_bits = 0. > See at hashagg_spill_init. > On Windows 64 bits (HEAD) fails with partition_prune: > parallel group (11 tests): reloptions hash_part partition_info explain compression resultcache indexing partition_joinpartition_aggregate partition_prune tuplesort > partition_join ... ok 3495 ms > partition_prune ... FAILED 4926 ms > > diff -w -U3 C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out > --- C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out 2021-06-23 11:11:26.489575100 -0300 > +++ C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out 2021-06-29 10:54:43.103775700 -0300 > @@ -2660,7 +2660,7 @@ > -------------------------------------------------------------------------- > Nested Loop (actual rows=3 loops=1) > -> Seq Scan on tbl1 (actual rows=5 loops=1) > - -> Append (actual rows=1 loops=5) > + -> Append (actual rows=0 loops=5) > -> Index Scan using tprt1_idx on tprt_1 (never executed) > Index Cond: (col1 = tbl1.col1) > -> Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2) > > With patch attached: > parallel group (11 tests): partition_info hash_part resultcache reloptions explain compression indexing partition_aggregatepartition_join tuplesort partition_prune > partition_join ... ok 3013 ms > partition_prune ... ok 3959 ms This failure was reported to me along with this thread so I had a look at it. Firstly, I'm a bit confused as to why you think making a change in nodeAgg.c would have any effect on a plan that does not contain any aggregate node. As for the regression test failure. I can recreate it, but I did have to install VS2019 version 16.9.3 from https://docs.microsoft.com/en-us/visualstudio/releases/2019/history This basically boils down to the 16.9.3 compiler outputting "0" for: #include <stdio.h> int main(void) { printf("%.0f\n", 0.59999999999999998); return 0; } but we expect it to output "1". We name use of the provided sprintf() function in snprintf.c line 1188 with: vallen = sprintf(convert, fmt, prec, value); I don't see the problem in more recent versions of VS2019, but I didn't go to the trouble of figuring out exactly which version this was fixed in. David
Re: Avoid choose invalid number of partitions (src/backend/executor/nodeAgg.c)
From
Ranier Vilela
Date:
Em seg., 30 de ago. de 2021 às 07:44, David Rowley <dgrowleyml@gmail.com> escreveu:
On Wed, 30 Jun 2021 at 02:33, Ranier Vilela <ranier.vf@gmail.com> wrote:
> hash_choose_num_partitions function has issues.
> There are at least two path calls made with used_bits = 0.
> See at hashagg_spill_init.
> On Windows 64 bits (HEAD) fails with partition_prune:
> parallel group (11 tests): reloptions hash_part partition_info explain compression resultcache indexing partition_join partition_aggregate partition_prune tuplesort
> partition_join ... ok 3495 ms
> partition_prune ... FAILED 4926 ms
>
> diff -w -U3 C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out
> --- C:/dll/postgres/postgres_head/src/test/regress/expected/partition_prune.out 2021-06-23 11:11:26.489575100 -0300
> +++ C:/dll/postgres/postgres_head/src/test/regress/results/partition_prune.out 2021-06-29 10:54:43.103775700 -0300
> @@ -2660,7 +2660,7 @@
> --------------------------------------------------------------------------
> Nested Loop (actual rows=3 loops=1)
> -> Seq Scan on tbl1 (actual rows=5 loops=1)
> - -> Append (actual rows=1 loops=5)
> + -> Append (actual rows=0 loops=5)
> -> Index Scan using tprt1_idx on tprt_1 (never executed)
> Index Cond: (col1 = tbl1.col1)
> -> Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
>
> With patch attached:
> parallel group (11 tests): partition_info hash_part resultcache reloptions explain compression indexing partition_aggregate partition_join tuplesort partition_prune
> partition_join ... ok 3013 ms
> partition_prune ... ok 3959 ms
This failure was reported to me along with this thread so I had a look at it.
Thanks.
Firstly, I'm a bit confused as to why you think making a change in
nodeAgg.c would have any effect on a plan that does not contain any
aggregate node.
Yeah, they are unrelated.
For some reason, when checking the regress, partion_prune was ok and I mistakenly made a connection with the changes, which is wrong.
As for the regression test failure. I can recreate it, but I did have
to install VS2019 version 16.9.3 from
https://docs.microsoft.com/en-us/visualstudio/releases/2019/history
This basically boils down to the 16.9.3 compiler outputting "0" for:
#include <stdio.h>
int main(void)
{
printf("%.0f\n", 0.59999999999999998);
return 0;
}
but we expect it to output "1".
We name use of the provided sprintf() function in snprintf.c line 1188 with:
vallen = sprintf(convert, fmt, prec, value);
I don't see the problem in more recent versions of VS2019, but I
didn't go to the trouble of figuring out exactly which version this
was fixed in.
Regarding this test, with the last msvc, compiled for Debug, it still occurs:
partition_join ... ok 4267 ms
partition_prune ... FAILED 5270 ms
reloptions ... ok 755 ms
hash_part ... ok 494 ms
partition_prune ... FAILED 5270 ms
reloptions ... ok 755 ms
hash_part ... ok 494 ms
I still believe it's the compiler problem.
regards,
Ranier Vilela