Re: DBT-3 with SF=20 got failed - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: DBT-3 with SF=20 got failed
Date
Msg-id 557A19D1.9050107@2ndquadrant.com
Whole thread Raw
In response to Re: DBT-3 with SF=20 got failed  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: DBT-3 with SF=20 got failed  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Hi,

On 06/11/15 16:28, Robert Haas wrote:
> On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
>> Indeed, this hash table is constructed towards the relation with nrows=119994544,
>> so, it is not strange even if hash-slot itself is larger than 1GB.
>
> You forgot to attach the patch, I think.  It looks to me like the size
> of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
> That's a lot of buckets, but maybe not unreasonably many if you've got
> enough memory.

Actually, HashJoinTuple is just a pointer, so it's 8 bytes, so 1GB is 
enough for 134217728 million rows, which is more than the 119994544 rows 
from the plan.

Also, looking at the error message again:
    ERROR:  invalid memory alloc request size 1073741824

but this time with beer goggles, I noticed that the amount reported is 
exactly 1GB. The backtrace also shows the error happens right inside 
ExecHashTableCreate (and not in the resize which may happen later), 
which means it gets the nbuckets from ExecChooseHashTableSize directly.

The resize is probably still broken as I mentioned before, but this 
crash before reaching that code as the estimates are high enough to 
trigger the issue. But ExecChooseHashTableSize is supposed to keep all 
the checks from previous versions, and I think it actually does.

But I don't see there any checks regarding the 1GB boundary. What I see 
is this:
  max_pointers = (work_mem * 1024L) / sizeof(void *);  max_pointers = Min(max_pointers, INT_MAX / 2);
  ...
  dbuckets = Min(dbuckets, max_pointers);

That has nothing to do with 1GB, and it's in the code since the time 
work_mem was limited by 2GB, so perhaps there was some reasoning that 
it's sufficient (because the tuples stored in the hash table will need 
more than 1/2 of the memory, or something like that).

But today this issue is more likely, because people have more RAM and 
use higher work_mem values, so the max_pointers value gets much higher. 
In the extreme it may get to INT_MAX/2, so ~1 billion, so the buckets 
would allocate ~8B on 64-bit machines (on 32-bit machines we'd also get 
twice the number of pointers, compared to 64 bits, but that's mostly 
irrelevant, because of the memory size limits).

It's also true, that the hash-join improvements in 9.5 - namely the 
decrease of NTUP_PER_BUCKET from 10 to 1, made this error more likely. 
With 9.4 we'd use only 16777216 buckets (128MB), because that gets us 
below 10 tuples per bucket. But now we're shooting for 1 tuple per 
bucket, so we end up with 131M buckets, and that's 1GB.

I see two ways to fix this:

(1) enforce the 1GB limit (probably better for back-patching, if that's    necessary)

(2) make it work with hash tables over 1GB

I'm in favor of (2) if there's a good way to do that. It seems a bit 
stupid not to be able to use fast hash table because there's some 
artificial limit. Are there any fundamental reasons not to use the 
MemoryContextAllocHuge fix, proposed by KaiGai-san?


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: On columnar storage
Next
From: Qingqing Zhou
Date:
Subject: Re: On columnar storage