Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |
Date | |
Msg-id | 20191108233044.awe5owizhowebyy5@development Whole thread Raw |
In response to | BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash (PG Bug reporting form <noreply@postgresql.org>) |
Responses |
Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
|
List | pgsql-bugs |
On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: >The following bug has been logged on the website: > >Bug reference: 16104 >Logged by: James Coleman >Email address: jtc331@gmail.com >PostgreSQL version: 11.5 >Operating system: Debian >Description: > >We have a query that, after a recent logical migration to 11.5, ends up with >a parallel hash join (I don't think the query plan/query itself is important >here, but if needed after the rest of the explanation, I can try to redact >it for posting). The query results in this error: > >ERROR: invalid DSA memory alloc request size 1375731712 > >(the size changes sometimes significantly, but always over a GB) > >At first glance it sounded eerily similar to this report which preceded the >final release of 11.0: >https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde >but I confirmed that the patch for that bug was applied and is in 11.5 (and >earlier). > >We managed to reproduce this on a replica, and so were able to attach gdb in >production to capture a backtrace: > >#0 errfinish (dummy=dummy@entry=0) at >./build/../src/backend/utils/error/elog.c:423 >#1 0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20, >fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at >./build/../src/backend/utils/error/elog.c:1385 >#2 0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38, >size=1140850688, flags=flags@entry=4) at >./build/../src/backend/utils/mmgr/dsa.c:677 >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at >./build/../src/backend/executor/nodeHash.c:2889 > ... I've briefly looked at this today, and I think the root cause is somewhat similar to what is described in [1] where we simply increase the number of batches in an effort to keep batch contents in work_mem, but ignoring that each batch requires quite a bit of memory. So we end up with a lot of batches where each is small enough to fit into work_mem, but we need much more than work_mem to track the batches. This seems to be about the same problem, except that instead of forgeting about BufFile, the parallel hash join ignores this: pstate->batches = dsa_allocate0(hashtable->area, EstimateParallelHashJoinBatch(hashtable) * nbatch); Looking at the backtrace, you ended up with nbatch=2097152 and it failed trying to allocate 1140850688 bytes, i.e. exactly 544B per batch. The EstimateParallelHashJoinBatch combines a bunch of structs and does depend on the number of participants, and by my rough calculation 544B means ~13 participants. Any idea if that matches the number of parallel workers for this query? Can you try running the query with fewer workers per gather? IMHO this has the same issue as the non-parallel case described in [1] in that we don't really consider this memory when planning the query, which means we may accept hash join for queries where we know the number of batches will be too high during planning. We need to reject parallel hash join in those cases. Of course, I don't know if this is such case - perhaps this is a case of underestimation and we only learn the actual number while executing the query, at which point it's too late to reject PHJ. I suppose the right solution in that case would be some form of BNL, as discussed in [2] but I'm not sure how far are we from having that, or if that works for PHJ. regards [1] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh@development [2] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-bugs by date: