Thread: Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

From
Kouhei Kaigai
Date:
Hello,

I noticed ExecChooseHashTableSize() in nodeHash.c got failed by
Assert(nbuckets > 0), when crazy number of rows are expected.

BACKTRACE:

#0  0x0000003f79432625 in raise () from /lib64/libc.so.6
#1  0x0000003f79433e05 in abort () from /lib64/libc.so.6
#2  0x000000000092600a in ExceptionalCondition (conditionName=0xac1ea0 "!(nbuckets > 0)",
    errorType=0xac1d88 "FailedAssertion", fileName=0xac1d40 "nodeHash.c", lineNumber=545) at assert.c:54
#3  0x00000000006851ff in ExecChooseHashTableSize (ntuples=60521928028, tupwidth=8, useskew=1 '\001',
    numbuckets=0x7fff146bff04, numbatches=0x7fff146bff00, num_skew_mcvs=0x7fff146bfefc) at nodeHash.c:545
#4  0x0000000000701735 in initial_cost_hashjoin (root=0x253a318, workspace=0x7fff146bffc0, jointype=JOIN_SEMI,
    hashclauses=0x257e4f0, outer_path=0x2569a40, inner_path=0x2569908, sjinfo=0x2566f40, semifactors=0x7fff146c0168)
    at costsize.c:2592
#5  0x000000000070e02a in try_hashjoin_path (root=0x253a318, joinrel=0x257d940, outer_path=0x2569a40,
inner_path=0x2569908,
    hashclauses=0x257e4f0, jointype=JOIN_SEMI, extra=0x7fff146c0150) at joinpath.c:543


See the following EXPLAIN output, at the configuration without --enable-cassert.
Planner expects 60.5B rows towards the self join by a relation with 72M rows.
(Probably, this estimation is too much.)

[kaigai@ayu ~]$ (echo EXPLAIN; cat ~/tpcds/query95.sql) | psql tpcds100
                                                                                       QUERY PLAN
     

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9168667273.07..9168667273.08 rows=1 width=20)
   CTE ws_wh
     ->  Custom Scan (GpuJoin)  (cost=3342534.49..654642911.88 rows=60521928028 width=24)
           Bulkload: On (density: 100.00%)
           Depth 1: Logic: GpuHashJoin, HashKeys: (ws_order_number), JoinQual: ((ws_warehouse_sk <> ws_warehouse_sk)
AND(ws_order_number = ws_order_number)), nrows (ratio: 84056.77%) 
           ->  Custom Scan (BulkScan) on web_sales ws1_1  (cost=0.00..3290612.48 rows=72001248 width=16)
           ->  Seq Scan on web_sales ws2  (cost=0.00..3290612.48 rows=72001248 width=16)
   ->  Sort  (cost=8514024361.19..8514024361.20 rows=1 width=20)
         Sort Key: (count(DISTINCT ws1.ws_order_number))
      :

This crash was triggered by Assert(nbuckets > 0), and nbuckets is calculated
as follows.

    /*
     * If there's not enough space to store the projected number of tuples and
     * the required bucket headers, we will need multiple batches.
     */
    if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
    {
        /* We'll need multiple batches */
        long        lbuckets;
        double      dbatch;
        int         minbatch;
        long        bucket_size;

        /*
         * Estimate the number of buckets we'll want to have when work_mem is
         * entirely full.  Each bucket will contain a bucket pointer plus
         * NTUP_PER_BUCKET tuples, whose projected size already includes
         * overhead for the hash code, pointer to the next tuple, etc.
         */
        bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
        lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
        lbuckets = Min(lbuckets, max_pointers);
        nbuckets = (int) lbuckets;
        bucket_bytes = nbuckets * sizeof(HashJoinTuple);
          :
          :
    }
    Assert(nbuckets > 0);
    Assert(nbatch > 0);

In my case, the hash_table_bytes was 101017630802, and bucket_size was 48.
So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have
negative number because both "1" and my_log2() is int32.
So, Min(lbuckets, max_pointers) chooses 0x80000000, then it was set on
the nbuckets and triggers the Assert().

Attached patch fixes the problem.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Attachment

Re: Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

From
Kevin Grittner
Date:
Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

>         long        lbuckets;

>         lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);

>     Assert(nbuckets > 0);

> In my case, the hash_table_bytes was 101017630802, and bucket_size was 48.
> So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have
> negative number because both "1" and my_log2() is int32.
> So, Min(lbuckets, max_pointers) chooses 0x80000000, then it was set on
> the nbuckets and triggers the Assert().

> Attached patch fixes the problem.

So you changed the literal of 1 to 1U, but doesn't that just double
the threshold for failure?  Wouldn't 1L (to match the definition of
lbuckets) be better?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



> Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> 
> >         long        lbuckets;
> 
> >         lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
> 
> >     Assert(nbuckets > 0);
> 
> > In my case, the hash_table_bytes was 101017630802, and bucket_size was 48.
> > So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have
> > negative number because both "1" and my_log2() is int32.
> > So, Min(lbuckets, max_pointers) chooses 0x80000000, then it was set on
> > the nbuckets and triggers the Assert().
> 
> > Attached patch fixes the problem.
> 
> So you changed the literal of 1 to 1U, but doesn't that just double
> the threshold for failure?  Wouldn't 1L (to match the definition of
> lbuckets) be better?
>
Ah, yes, it is my careless mistake.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


On 19 August 2015 at 08:54, Kevin Grittner <kgrittn@ymail.com> wrote:
Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

>         long        lbuckets;

>         lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);

>     Assert(nbuckets > 0);

> In my case, the hash_table_bytes was 101017630802, and bucket_size was 48.
> So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have
> negative number because both "1" and my_log2() is int32.
> So, Min(lbuckets, max_pointers) chooses 0x80000000, then it was set on
> the nbuckets and triggers the Assert().

> Attached patch fixes the problem.

So you changed the literal of 1 to 1U, but doesn't that just double
the threshold for failure?  Wouldn't 1L (to match the definition of
lbuckets) be better?


I agree, but I can only imagine this is happening because the maximum setting of work_mem has been modified with the code you're running.

hash_tables_bytes is set based on work_mem

hash_table_bytes = work_mem * 1024L;

The size of your hash table is 101017630802 bytes, which is:

david=# select pg_size_pretty(101017630802);
 pg_size_pretty
----------------
 94 GB
(1 row)

david=# set work_mem = '94GB';
ERROR:  98566144 is outside the valid range for parameter "work_mem" (64 .. 2097151)

So I think the only way the following could cause an error, is if bucket_size was 1, which it can't be.

lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);

I think one day soon we'll need to allow larger work_mem sizes, but I think there's lots more to do than this change.

Regards

David Rowley

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
 
> -----Original Message-----
> From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> Sent: Wednesday, August 19, 2015 9:00 AM
> To: Kevin Grittner
> Cc: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with
> crazy number of rows
> 
> On 19 August 2015 at 08:54, Kevin Grittner <kgrittn@ymail.com> wrote:
> 
> 
>     Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> 
>     >         long        lbuckets;
> 
>     >         lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
> 
>     >     Assert(nbuckets > 0);
> 
>     > In my case, the hash_table_bytes was 101017630802, and bucket_size was
> 48.
>     > So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will
> have
>     > negative number because both "1" and my_log2() is int32.
>     > So, Min(lbuckets, max_pointers) chooses 0x80000000, then it was set
> on
>     > the nbuckets and triggers the Assert().
> 
>     > Attached patch fixes the problem.
> 
>     So you changed the literal of 1 to 1U, but doesn't that just double
>     the threshold for failure?  Wouldn't 1L (to match the definition of
>     lbuckets) be better?
> 
> 
> 
> 
> I agree, but I can only imagine this is happening because the maximum setting
> of work_mem has been modified with the code you're running.
> 
> hash_tables_bytes is set based on work_mem
> 
> hash_table_bytes = work_mem * 1024L;
> 
> The size of your hash table is 101017630802 bytes, which is:
> 
> david=# select pg_size_pretty(101017630802);
> 
>  pg_size_pretty
> ----------------
>  94 GB
> (1 row)
> 
> david=# set work_mem = '94GB';
> ERROR:  98566144 is outside the valid range for parameter "work_mem" (64 ..
> 2097151)
>
Hmm. Why I could set work_mem = '96GB' without error.

It was described in the postgresql.conf.
 postgres=# SHOW work_mem;  work_mem ----------  96GB (1 row)

> So I think the only way the following could cause an error, is if bucket_size
> was 1, which it can't be.
> 
> lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
> 
> 
> I think one day soon we'll need to allow larger work_mem sizes, but I think there's
> lots more to do than this change.
>
I oversight this limitation, but why I can bypass GUC limitation check?

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

On 19 August 2015 at 12:23, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> -----Original Message-----
> From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> Sent: Wednesday, August 19, 2015 9:00 AM
> The size of your hash table is 101017630802 bytes, which is:
>
> david=# select pg_size_pretty(101017630802);
>
>  pg_size_pretty
> ----------------
>  94 GB
> (1 row)
>
> david=# set work_mem = '94GB';
> ERROR:  98566144 is outside the valid range for parameter "work_mem" (64 ..
> 2097151)
>
Hmm. Why I could set work_mem = '96GB' without error.

It was described in the postgresql.conf.

  postgres=# SHOW work_mem;
   work_mem
  ----------
   96GB
  (1 row)

> So I think the only way the following could cause an error, is if bucket_size
> was 1, which it can't be.
>
> lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
>
>
> I think one day soon we'll need to allow larger work_mem sizes, but I think there's
> lots more to do than this change.
>
I oversight this limitation, but why I can bypass GUC limitation check?
 
I'm unable to get the server to start if I set work_mem that big. I also tried starting the start with 1GB work_mem then doing a pg_ctl reload. With each of these I get the same error message that I would have gotten if I had done; set work_mem = '96GB';

Which version are you running?

Are you sure there's not changes in guc.c for work_mem's range?

Regards

David Rowley

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

David Rowley <david.rowley@2ndquadrant.com> writes:
> david=# set work_mem = '94GB';
> ERROR:  98566144 is outside the valid range for parameter "work_mem" (64 ..
> 2097151)

Apparently you're testing on a 32-bit server.  64-bit servers allow
work_mem to go up to INT_MAX kilobytes.
        regards, tom lane



On 19 August 2015 at 12:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <david.rowley@2ndquadrant.com> writes:
> david=# set work_mem = '94GB';
> ERROR:  98566144 is outside the valid range for parameter "work_mem" (64 ..
> 2097151)

Apparently you're testing on a 32-bit server.  64-bit servers allow
work_mem to go up to INT_MAX kilobytes.

hmm, no, but it does appear that sizeof(long) is 4 bytes for me, despite 64 bit OS.

I guess this accounts for my misunderstanding that work_mem was limited to 2GB even on 64 bit machines.

From guc.h

#if SIZEOF_SIZE_T > 4 && SIZEOF_LONG > 4
#define MAX_KILOBYTES INT_MAX
#else
#define MAX_KILOBYTES (INT_MAX / 1024)
#endif 

Apologies for the noise.

I reported a similar issue before.

* Re: DBT-3 with SF=20 got failed
http://www.postgresql.org/message-id/557A19D1.9050107@2ndquadrant.com

I didn't get a server crash at that time, however, palloc() complained
about request size >= 1GB.
So, we may need a couple of overhaul around HashJoin to support large
size of data, not only nbuckets around 0x80000000.

Also, we may need to pay attention to reliability of scale estimation
by planner.
Even though the plan below says that Join generates 60521928028 rows,
it actually generates 776157676 rows (0.12%).


tpcds100=# EXPLAIN ANALYZE select ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2from web_sales
ws1,web_salesws2where ws1.ws_order_number = ws2.ws_order_number  and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk;
                                                       QUERY PLAN
 


--------------------------------------------------------------------------------------------------------------------------------------------Merge
Join (cost=25374644.08..1160509591.61 rows=60521928028 width=24) (actual time=138347.979..491889.343 rows=776157676
loops=1) Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)  Join Filter: (ws1.ws_warehouse_sk <>
ws2.ws_warehouse_sk) Rows Removed by Join Filter: 127853313  ->  Sort  (cost=12687322.04..12867325.16 rows=72001248
width=16)(actual time=73252.300..79017.420 rows=72001237 loops=1)        Sort Key: ws1.ws_order_number        Sort
Method:quicksort  Memory: 7083296kB        ->  Seq Scan on web_sales ws1  (cost=0.00..3290612.48 rows=72001248
width=16)(actual time=0.023..39951.201 rows=72001237 loops=1)  ->  Sort  (cost=12687322.04..12867325.16 rows=72001248
width=16)(actual time=65095.655..128885.811 rows=904010978 loops=1)        Sort Key: ws2.ws_order_number        Sort
Method:quicksort  Memory: 7083296kB        ->  Seq Scan on web_sales ws2  (cost=0.00..3290612.48 rows=72001248
width=16)(actual time=0.014..31046.888 rows=72001237 loops=1)Planning time: 0.232 msExecution time: 530176.521 ms
 
(14 rows)


So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of David Rowley
> Sent: Wednesday, August 19, 2015 10:07 AM
> To: Tom Lane
> Cc: Kevin Grittner; Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with
> crazy number of rows
> 
> On 19 August 2015 at 12:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> 
>     David Rowley <david.rowley@2ndquadrant.com> writes:
>     > david=# set work_mem = '94GB';
>     > ERROR:  98566144 is outside the valid range for parameter "work_mem"
> (64 ..
>     > 2097151)
> 
>     Apparently you're testing on a 32-bit server.  64-bit servers allow
>     work_mem to go up to INT_MAX kilobytes.
> 
> 
> 
> hmm, no, but it does appear that sizeof(long) is 4 bytes for me, despite 64 bit
> OS.
> 
> I guess this accounts for my misunderstanding that work_mem was limited to 2GB
> even on 64 bit machines.
> 
> From guc.h
> 
> #if SIZEOF_SIZE_T > 4 && SIZEOF_LONG > 4
> #define MAX_KILOBYTES INT_MAX
> #else
> #define MAX_KILOBYTES (INT_MAX / 1024)
> #endif
> 
> Apologies for the noise.


Re: Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

From
Kevin Grittner
Date:
Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

> we may need a couple of overhaul around HashJoin to support large
> size of data, not only nbuckets around 0x80000000.

Perhaps, but this is a clear bug, introduced to the 9.5 code, with
an obvious fix; so I've pushed the change from 1 to 1L on that left
shift.  There was clearly an attempt in surrounding code to deal
with size overruns by using a long, applying a min(), and casting
to int; this one statement just missed a trick.  If we need to
further constrain sizes to keep within allocation limits, that
seems like an entirely separate patch.

Thanks for finding and analyzing this and providing a patch!


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Kevin Grittner <kgrittn@ymail.com> writes:
> Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> we may need a couple of overhaul around HashJoin to support large
>> size of data, not only nbuckets around 0x80000000.

> Perhaps, but this is a clear bug, introduced to the 9.5 code, with
> an obvious fix; so I've pushed the change from 1 to 1L on that left
> shift.

I don't think it's anywhere near as clear as you think.  The preceding
lines should have limited nbuckets to be no more than INT_MAX/2, so how
could an overflow occur there?  (The result of 1<<mylog() should be
at most 0x40000000 AFAICS.)  If overflow is possible, how will s/1/1L/
make it better on machines where int and long are the same size?

And on machines where long is wider than int, you've still got a bug
if the result of the left shift somehow overflowed an int, because
it's going to be assigned to nbuckets which is an int.

So I think the "1" coding was just fine in context.  If there is an
overflow in ExecChooseHashTableSize, it's somewhere else.
        regards, tom lane



Re: Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

From
Kevin Grittner
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Kevin Grittner <kgrittn@ymail.com> writes:
>> Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>>> we may need a couple of overhaul around HashJoin to support large
>>> size of data, not only nbuckets around 0x80000000.

>> Perhaps, but this is a clear bug, introduced to the 9.5 code, with
>> an obvious fix; so I've pushed the change from 1 to 1L on that left
>> shift.

> I don't think it's anywhere near as clear as you think.  The preceding
> lines should have limited nbuckets to be no more than INT_MAX/2, so how
> could an overflow occur there?

There is a 1 shifted for nbuckets that I didn't touch because it
could not. The limits on nbuckets are not applied at the point of
the bug.

> (The result of 1<<mylog() should be
> at most 0x40000000 AFAICS.)  If overflow is possible, how will s/1/1L/
> make it better on machines where int and long are the same size?

In such cases there is not a bug -- at least not with my_log
returning less than 63.  In those cases the change will make no
difference.

> And on machines where long is wider than int, you've still got a bug
> if the result of the left shift somehow overflowed an int, because
> it's going to be assigned to nbuckets which is an int.

You seem to have missed reading the line in between:

| lbuckets = 1L << my_log2(hash_table_bytes / bucket_size);
| lbuckets = Min(lbuckets, max_pointers);
| nbuckets = (int) lbuckets;

There is a Min() and a cast to sort that out.

> So I think the "1" coding was just fine in context.

I disagree.

> If there is an overflow in ExecChooseHashTableSize, it's
> somewhere else.

I do believe that there are other places we need to place limits,
to avoid attempts to allocate more than 1GB, but I think patches to
address that should be separate from fixing this particular bug.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



I wrote:
> I don't think it's anywhere near as clear as you think.

Ah, scratch that --- I was looking at the wrong my_log2() call.
-ENOCAFFEINE.

I'm still doubtful that this is the only overflow risk in that new
ExecChooseHashTableSize code, though.  For instance, the only reason the
line immediately above this one isn't broken is that NTUP_PER_BUCKET is 1
... but surely we shouldn't be coding on that assumption.
        regards, tom lane



Re: Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

From
Kevin Grittner
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I'm still doubtful that this is the only overflow risk in that
> new ExecChooseHashTableSize code, though.

KaiGai already pointed that out on this thread and I completely
agree; but I figured that I might as well fix the clear bug with an
obvious fix that was causing an assertion failure during 9.5
testing pending analysis and fixes for the other hazards.  Getting
past the assertion failure might help identify some other bugs in
the area.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company