Thread: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
The following bug has been logged on the website: Bug reference: 16104 Logged by: James Coleman Email address: jtc331@gmail.com PostgreSQL version: 11.5 Operating system: Debian Description: We have a query that, after a recent logical migration to 11.5, ends up with a parallel hash join (I don't think the query plan/query itself is important here, but if needed after the rest of the explanation, I can try to redact it for posting). The query results in this error: ERROR: invalid DSA memory alloc request size 1375731712 (the size changes sometimes significantly, but always over a GB) At first glance it sounded eerily similar to this report which preceded the final release of 11.0: https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde but I confirmed that the patch for that bug was applied and is in 11.5 (and earlier). We managed to reproduce this on a replica, and so were able to attach gdb in production to capture a backtrace: #0 errfinish (dummy=dummy@entry=0) at ./build/../src/backend/utils/error/elog.c:423 #1 0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20, fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at ./build/../src/backend/utils/error/elog.c:1385 #2 0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38, size=1140850688, flags=flags@entry=4) at ./build/../src/backend/utils/mmgr/dsa.c:677 #3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches (hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at ./build/../src/backend/executor/nodeHash.c:2889 #4 0x000055a7c079e5f9 in ExecParallelHashIncreaseNumBatches (hashtable=0x55a7c1db2740) at ./build/../src/backend/executor/nodeHash.c:1122 #5 0x000055a7c079ef6e in ExecParallelHashTuplePrealloc (size=56, batchno=<optimized out>, hashtable=0x55a7c1db2740) at ./build/../src/backend/executor/nodeHash.c:3283 #6 ExecParallelHashTableInsert (hashtable=hashtable@entry=0x55a7c1db2740, slot=slot@entry=0x55a7c1dadc90, hashvalue=<optimized out>) at ./build/../src/backend/executor/nodeHash.c:1716 #7 0x000055a7c079f17f in MultiExecParallelHash (node=0x55a7c1dacb78) at ./build/../src/backend/executor/nodeHash.c:288 #8 MultiExecHash (node=node@entry=0x55a7c1dacb78) at ./build/../src/backend/executor/nodeHash.c:112 #9 0x000055a7c078c40c in MultiExecProcNode (node=node@entry=0x55a7c1dacb78) at ./build/../src/backend/executor/execProcnode.c:501 #10 0x000055a7c07a07d5 in ExecHashJoinImpl (parallel=true, pstate=<optimized out>) at ./build/../src/backend/executor/nodeHashjoin.c:290 #11 ExecParallelHashJoin (pstate=<optimized out>) at ./build/../src/backend/executor/nodeHashjoin.c:581 #12 0x000055a7c078bdd9 in ExecProcNodeInstr (node=0x55a7c1d7b018) at ./build/../src/backend/executor/execProcnode.c:461 #13 0x000055a7c079f142 in ExecProcNode (node=0x55a7c1d7b018) at ./build/../src/include/executor/executor.h:251 #14 MultiExecParallelHash (node=0x55a7c1d759d0) at ./build/../src/backend/executor/nodeHash.c:281 #15 MultiExecHash (node=node@entry=0x55a7c1d759d0) at ./build/../src/backend/executor/nodeHash.c:112 #16 0x000055a7c078c40c in MultiExecProcNode (node=node@entry=0x55a7c1d759d0) at ./build/../src/backend/executor/execProcnode.c:501 #17 0x000055a7c07a07d5 in ExecHashJoinImpl (parallel=true, pstate=<optimized out>) at ./build/../src/backend/executor/nodeHashjoin.c:290 #18 ExecParallelHashJoin (pstate=<optimized out>) at ./build/../src/backend/executor/nodeHashjoin.c:581 #19 0x000055a7c078bdd9 in ExecProcNodeInstr (node=0x55a7c1d74e60) at ./build/../src/backend/executor/execProcnode.c:461 #20 0x000055a7c0784303 in ExecProcNode (node=0x55a7c1d74e60) at ./build/../src/include/executor/executor.h:251 #21 ExecutePlan (execute_once=<optimized out>, dest=0x55a7c1d0be00, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55a7c1d74e60, estate=0x55a7c1d74b70) at ./build/../src/backend/executor/execMain.c:1640 #22 standard_ExecutorRun (queryDesc=0x55a7c1d5dcd0, direction=<optimized out>, count=0, execute_once=<optimized out>) at ./build/../src/backend/executor/execMain.c:369 #23 0x00007f4b8b9ace85 in pgss_ExecutorRun (queryDesc=0x55a7c1d5dcd0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:892 #24 0x000055a7c07893d1 in ParallelQueryMain (seg=0x55a7c1caa648, toc=<optimized out>) at ./build/../src/backend/executor/execParallel.c:1401 #25 0x000055a7c064ee64 in ParallelWorkerMain (main_arg=<optimized out>) at ./build/../src/backend/access/transam/parallel.c:1409 #26 0x000055a7c08568ed in StartBackgroundWorker () at ./build/../src/backend/postmaster/bgworker.c:834 #27 0x000055a7c08637b5 in do_start_bgworker (rw=0x55a7c1c98200) at ./build/../src/backend/postmaster/postmaster.c:5722 #28 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:5935 #29 0x000055a7c0864355 in sigusr1_handler (postgres_signal_arg=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:5096 #30 <signal handler called> #31 0x00007f4b915895e3 in select () from /lib/x86_64-linux-gnu/libc.so.6 #32 0x000055a7c05d8b5d in ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1671 #33 0x000055a7c08654f1 in PostmasterMain (argc=5, argv=0x55a7c1c73e50) at ./build/../src/backend/postmaster/postmaster.c:1380 #34 0x000055a7c05dac34 in main (argc=5, argv=0x55a7c1c73e50) at ./build/../src/backend/main/main.c:228 From what I can tell it seems that src/backend/executor/nodeHash.c:2888 (looking at the 11.5 release tag) is another entry point into similar potential problems as were guarded against in the patch I mentioned earlier, and that this is another way parallel hash nodes can end up attempting to allocate more memory than is allowed. Thanks, James Coleman
On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: >The following bug has been logged on the website: > >Bug reference: 16104 >Logged by: James Coleman >Email address: jtc331@gmail.com >PostgreSQL version: 11.5 >Operating system: Debian >Description: > >We have a query that, after a recent logical migration to 11.5, ends up with >a parallel hash join (I don't think the query plan/query itself is important >here, but if needed after the rest of the explanation, I can try to redact >it for posting). The query results in this error: > >ERROR: invalid DSA memory alloc request size 1375731712 > >(the size changes sometimes significantly, but always over a GB) > >At first glance it sounded eerily similar to this report which preceded the >final release of 11.0: >https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde >but I confirmed that the patch for that bug was applied and is in 11.5 (and >earlier). > >We managed to reproduce this on a replica, and so were able to attach gdb in >production to capture a backtrace: > >#0 errfinish (dummy=dummy@entry=0) at >./build/../src/backend/utils/error/elog.c:423 >#1 0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20, >fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at >./build/../src/backend/utils/error/elog.c:1385 >#2 0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38, >size=1140850688, flags=flags@entry=4) at >./build/../src/backend/utils/mmgr/dsa.c:677 >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at >./build/../src/backend/executor/nodeHash.c:2889 > ... I've briefly looked at this today, and I think the root cause is somewhat similar to what is described in [1] where we simply increase the number of batches in an effort to keep batch contents in work_mem, but ignoring that each batch requires quite a bit of memory. So we end up with a lot of batches where each is small enough to fit into work_mem, but we need much more than work_mem to track the batches. This seems to be about the same problem, except that instead of forgeting about BufFile, the parallel hash join ignores this: pstate->batches = dsa_allocate0(hashtable->area, EstimateParallelHashJoinBatch(hashtable) * nbatch); Looking at the backtrace, you ended up with nbatch=2097152 and it failed trying to allocate 1140850688 bytes, i.e. exactly 544B per batch. The EstimateParallelHashJoinBatch combines a bunch of structs and does depend on the number of participants, and by my rough calculation 544B means ~13 participants. Any idea if that matches the number of parallel workers for this query? Can you try running the query with fewer workers per gather? IMHO this has the same issue as the non-parallel case described in [1] in that we don't really consider this memory when planning the query, which means we may accept hash join for queries where we know the number of batches will be too high during planning. We need to reject parallel hash join in those cases. Of course, I don't know if this is such case - perhaps this is a case of underestimation and we only learn the actual number while executing the query, at which point it's too late to reject PHJ. I suppose the right solution in that case would be some form of BNL, as discussed in [2] but I'm not sure how far are we from having that, or if that works for PHJ. regards [1] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh@development [2] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: > >The following bug has been logged on the website: > > > >Bug reference: 16104 > >Logged by: James Coleman > >Email address: jtc331@gmail.com > >PostgreSQL version: 11.5 > >Operating system: Debian > >Description: > > > >We have a query that, after a recent logical migration to 11.5, ends up with > >a parallel hash join (I don't think the query plan/query itself is important > >here, but if needed after the rest of the explanation, I can try to redact > >it for posting). The query results in this error: > > > >ERROR: invalid DSA memory alloc request size 1375731712 > > > >(the size changes sometimes significantly, but always over a GB) > > > >At first glance it sounded eerily similar to this report which preceded the > >final release of 11.0: > >https://www.postgresql.org/message-id/flat/CAEepm%3D1x48j0P5gwDUXyo6c9xRx0t_57UjVaz6X98fEyN-mQ4A%40mail.gmail.com#465f3a61bea2719bc4a7102541326dde > >but I confirmed that the patch for that bug was applied and is in 11.5 (and > >earlier). > > > >We managed to reproduce this on a replica, and so were able to attach gdb in > >production to capture a backtrace: > > > >#0 errfinish (dummy=dummy@entry=0) at > >./build/../src/backend/utils/error/elog.c:423 > >#1 0x000055a7c0a00f79 in elog_finish (elevel=elevel@entry=20, > >fmt=fmt@entry=0x55a7c0babc18 "invalid DSA memory alloc request size %zu") at > >./build/../src/backend/utils/error/elog.c:1385 > >#2 0x000055a7c0a2308b in dsa_allocate_extended (area=0x55a7c1d6aa38, > >size=1140850688, flags=flags@entry=4) at > >./build/../src/backend/utils/mmgr/dsa.c:677 > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at > >./build/../src/backend/executor/nodeHash.c:2889 > > ... > > I've briefly looked at this today, and I think the root cause is > somewhat similar to what is described in [1] where we simply increase > the number of batches in an effort to keep batch contents in work_mem, > but ignoring that each batch requires quite a bit of memory. So we end > up with a lot of batches where each is small enough to fit into > work_mem, but we need much more than work_mem to track the batches. > > This seems to be about the same problem, except that instead of > forgeting about BufFile, the parallel hash join ignores this: > > pstate->batches = > dsa_allocate0(hashtable->area, > EstimateParallelHashJoinBatch(hashtable) * nbatch); > > Looking at the backtrace, you ended up with nbatch=2097152 and it failed > trying to allocate 1140850688 bytes, i.e. exactly 544B per batch. > > The EstimateParallelHashJoinBatch combines a bunch of structs and does > depend on the number of participants, and by my rough calculation 544B > means ~13 participants. Any idea if that matches the number of parallel > workers for this query? We currently have max_parallel_workers_per_gather = 6, which is where the repro happened with the backtrace. We initially had the problem with max_parallel_workers_per_gather = 2. I also just confirmed an explain shows Workers Planned: 6. > Can you try running the query with fewer workers per gather? Since it failed initially with 2, I think we've effectively tested this already (I assume that's a significant enough change to be expected to solve it). > IMHO this has the same issue as the non-parallel case described in [1] > in that we don't really consider this memory when planning the query, > which means we may accept hash join for queries where we know the number > of batches will be too high during planning. We need to reject parallel > hash join in those cases. > > Of course, I don't know if this is such case - perhaps this is a case of > underestimation and we only learn the actual number while executing the > query, at which point it's too late to reject PHJ. I suppose the right > solution in that case would be some form of BNL, as discussed in [2] but > I'm not sure how far are we from having that, or if that works for PHJ. > > > regards > > [1] https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh@development > > [2] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote: > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: > > >ERROR: invalid DSA memory alloc request size 1375731712 > > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at > > I've briefly looked at this today, and I think the root cause is > > somewhat similar to what is described in [1] where we simply increase > > the number of batches in an effort to keep batch contents in work_mem, > > but ignoring that each batch requires quite a bit of memory. So we end > > up with a lot of batches where each is small enough to fit into > > work_mem, but we need much more than work_mem to track the batches. Yeah. So even when this is fixed, the query is going to perform *terribly*, opening and closing millions of files in random order to stream tuples into, if this is case where there really are tuples to go to all partitions (and not just a case of extreme skew that our extreme skew detector fails to detect because it only detects absolute extreme skew). > > This seems to be about the same problem, except that instead of > > forgeting about BufFile, the parallel hash join ignores this: > > > > pstate->batches = > > dsa_allocate0(hashtable->area, > > EstimateParallelHashJoinBatch(hashtable) * nbatch); Yeah, I failed to consider that possibility. I suppose it could be avoided with something like this (not tested, I will find a repro for this on Monday to convince myself that it's right): @@ -1246,7 +1246,10 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable) } /* Don't keep growing if it's not helping or we'd overflow. */ - if (extreme_skew_detected || hashtable->nbatch >= INT_MAX / 2) + if (extreme_skew_detected || + hashtable->nbatch >= INT_MAX / 2 || + !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) * + hashtable->nbatch * 2)) pstate->growth = PHJ_GROWTH_DISABLED; else if (space_exhausted) pstate->growth = PHJ_GROWTH_NEED_MORE_BATCHES; But James's query is still going to be terrible. Do you know if it's extreme skew (many tuples with the same key, just a few scattered around in other keys), or simply too much data for your work_mem setting?
On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote: > > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: > > > >ERROR: invalid DSA memory alloc request size 1375731712 > > > > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches > > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at > > > > I've briefly looked at this today, and I think the root cause is > > > somewhat similar to what is described in [1] where we simply increase > > > the number of batches in an effort to keep batch contents in work_mem, > > > but ignoring that each batch requires quite a bit of memory. So we end > > > up with a lot of batches where each is small enough to fit into > > > work_mem, but we need much more than work_mem to track the batches. > > Yeah. So even when this is fixed, the query is going to perform > *terribly*, opening and closing millions of files in random order to > stream tuples into, if this is case where there really are tuples to > go to all partitions (and not just a case of extreme skew that our > extreme skew detector fails to detect because it only detects absolute > extreme skew). work_mem in the repro case is 500MB (the original failure was at 150MB). I realize that's too small for this query, though it's also worth knowing that if I get rid of some other cluster-wide tunings that shouldn't have been cluster-wide original (modifications to cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into an index scan and no hash (and performs much better). I think this also correlates with us seeing ~TB spike in disk usage, so your explanation of the lots of "small" files would seem to be consistent with that. > > > This seems to be about the same problem, except that instead of > > > forgeting about BufFile, the parallel hash join ignores this: > > > > > > pstate->batches = > > > dsa_allocate0(hashtable->area, > > > EstimateParallelHashJoinBatch(hashtable) * nbatch); > > Yeah, I failed to consider that possibility. I suppose it could be > avoided with something like this (not tested, I will find a repro for > this on Monday to convince myself that it's right): > > @@ -1246,7 +1246,10 @@ > ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable) > } > > /* Don't keep growing if it's not > helping or we'd overflow. */ > - if (extreme_skew_detected || > hashtable->nbatch >= INT_MAX / 2) > + if (extreme_skew_detected || > + hashtable->nbatch >= INT_MAX / 2 || > + > !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) * > + > hashtable->nbatch * 2)) > pstate->growth = PHJ_GROWTH_DISABLED; > else if (space_exhausted) > pstate->growth = > PHJ_GROWTH_NEED_MORE_BATCHES; > > But James's query is still going to be terrible. > > Do you know if it's extreme skew (many tuples with the same key, just > a few scattered around in other keys), or simply too much data for > your work_mem setting? Given my description earlier (seq scan on a very large table), I assume it's likely the latter? If you think that's sufficiently likely, I'll leave it at that, or if not I could do calculation on that key to see how distributed it is. James
On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote: >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote: >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: >> > > >ERROR: invalid DSA memory alloc request size 1375731712 >> >> > > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at >> >> > > I've briefly looked at this today, and I think the root cause is >> > > somewhat similar to what is described in [1] where we simply increase >> > > the number of batches in an effort to keep batch contents in work_mem, >> > > but ignoring that each batch requires quite a bit of memory. So we end >> > > up with a lot of batches where each is small enough to fit into >> > > work_mem, but we need much more than work_mem to track the batches. >> >> Yeah. So even when this is fixed, the query is going to perform >> *terribly*, opening and closing millions of files in random order to >> stream tuples into, if this is case where there really are tuples to >> go to all partitions (and not just a case of extreme skew that our >> extreme skew detector fails to detect because it only detects absolute >> extreme skew). > >work_mem in the repro case is 500MB (the original failure was at >150MB). I realize that's too small for this query, though it's also >worth knowing that if I get rid of some other cluster-wide tunings >that shouldn't have been cluster-wide original (modifications to >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into >an index scan and no hash (and performs much better). > So is it a case of underestimate? I.e. does the Hash side expect much less data (rows) than it gets during execution? >I think this also correlates with us seeing ~TB spike in disk usage, >so your explanation of the lots of "small" files would seem to be >consistent with that. That's consistent with the data. 500MB and nbatch=2097152 is exactly 1TB, and there'll be some additional overhead. > >> > > This seems to be about the same problem, except that instead of >> > > forgeting about BufFile, the parallel hash join ignores this: >> > > >> > > pstate->batches = >> > > dsa_allocate0(hashtable->area, >> > > EstimateParallelHashJoinBatch(hashtable) * nbatch); >> >> Yeah, I failed to consider that possibility. I suppose it could be >> avoided with something like this (not tested, I will find a repro for >> this on Monday to convince myself that it's right): >> >> @@ -1246,7 +1246,10 @@ >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable) >> } >> >> /* Don't keep growing if it's not >> helping or we'd overflow. */ >> - if (extreme_skew_detected || >> hashtable->nbatch >= INT_MAX / 2) >> + if (extreme_skew_detected || >> + hashtable->nbatch >= INT_MAX / 2 || >> + >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) * >> + >> hashtable->nbatch * 2)) >> pstate->growth = PHJ_GROWTH_DISABLED; >> else if (space_exhausted) >> pstate->growth = >> PHJ_GROWTH_NEED_MORE_BATCHES; >> Probably, but that's the execution part of it. I think we should also consider this in ExecChooseHashTableSize, and just don't do PHJ when it exceeds work_mem from the very beginning. Once we start executing we probably can't do much better than what you proposed. In particular it doesn't make sense to limit the space by work_mem, unless we also tweak that limit because then the batch size increases arbitrarily. I think we need do something about this in PG13 - both while planning (considering BufFile and SharedTuplestore), and during execution. The planner part seems fairly simple and independent, and I might have said before I'll look into it. For the executor I think we've agreed the "proper" solution is BNL or something like that. Not sure how far are we from that, though, I don't recall any recent updates (but maybe I just missed that, the pgsql-hackers traffic is pretty insane). I wonder if we should get something like the "rebalancing" I proposed before, which is not a 100% fix but may at least reduce the negative impact. >> But James's query is still going to be terrible. >> >> Do you know if it's extreme skew (many tuples with the same key, just >> a few scattered around in other keys), or simply too much data for >> your work_mem setting? > >Given my description earlier (seq scan on a very large table), I >assume it's likely the latter? If you think that's sufficiently >likely, I'll leave it at that, or if not I could do calculation on >that key to see how distributed it is. > Depends where exactly is the seq scan - is it under the Hash? If yes, how come we even pick hash join in this case? Hard to say without seeing the query plan. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Just realized I accidentally replied only to Tomas. On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote: > >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote: > >> > >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote: > >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: > >> > > >ERROR: invalid DSA memory alloc request size 1375731712 > >> > >> > > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches > >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at > >> > >> > > I've briefly looked at this today, and I think the root cause is > >> > > somewhat similar to what is described in [1] where we simply increase > >> > > the number of batches in an effort to keep batch contents in work_mem, > >> > > but ignoring that each batch requires quite a bit of memory. So we end > >> > > up with a lot of batches where each is small enough to fit into > >> > > work_mem, but we need much more than work_mem to track the batches. > >> > >> Yeah. So even when this is fixed, the query is going to perform > >> *terribly*, opening and closing millions of files in random order to > >> stream tuples into, if this is case where there really are tuples to > >> go to all partitions (and not just a case of extreme skew that our > >> extreme skew detector fails to detect because it only detects absolute > >> extreme skew). > > > >work_mem in the repro case is 500MB (the original failure was at > >150MB). I realize that's too small for this query, though it's also > >worth knowing that if I get rid of some other cluster-wide tunings > >that shouldn't have been cluster-wide original (modifications to > >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into > >an index scan and no hash (and performs much better). > > > > So is it a case of underestimate? I.e. does the Hash side expect much > less data (rows) than it gets during execution? I've attached a redacted plan, but the estimate with workers planned = 6 is 986,975,281, which seems quite reasonable to be as the current table count is 5,917,539,491. > >I think this also correlates with us seeing ~TB spike in disk usage, > >so your explanation of the lots of "small" files would seem to be > >consistent with that. > > That's consistent with the data. 500MB and nbatch=2097152 is exactly > 1TB, and there'll be some additional overhead. Good, that helps make sense of that piece of the puzzle. > >> > > This seems to be about the same problem, except that instead of > >> > > forgeting about BufFile, the parallel hash join ignores this: > >> > > > >> > > pstate->batches = > >> > > dsa_allocate0(hashtable->area, > >> > > EstimateParallelHashJoinBatch(hashtable) * nbatch); > >> > >> Yeah, I failed to consider that possibility. I suppose it could be > >> avoided with something like this (not tested, I will find a repro for > >> this on Monday to convince myself that it's right): > >> > >> @@ -1246,7 +1246,10 @@ > >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable) > >> } > >> > >> /* Don't keep growing if it's not > >> helping or we'd overflow. */ > >> - if (extreme_skew_detected || > >> hashtable->nbatch >= INT_MAX / 2) > >> + if (extreme_skew_detected || > >> + hashtable->nbatch >= INT_MAX / 2 || > >> + > >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) * > >> + > >> hashtable->nbatch * 2)) > >> pstate->growth = PHJ_GROWTH_DISABLED; > >> else if (space_exhausted) > >> pstate->growth = > >> PHJ_GROWTH_NEED_MORE_BATCHES; > >> > > Probably, but that's the execution part of it. I think we should also > consider this in ExecChooseHashTableSize, and just don't do PHJ when > it exceeds work_mem from the very beginning. > > Once we start executing we probably can't do much better than what you > proposed. In particular it doesn't make sense to limit the space by > work_mem, unless we also tweak that limit because then the batch size > increases arbitrarily. > > I think we need do something about this in PG13 - both while planning > (considering BufFile and SharedTuplestore), and during execution. The > planner part seems fairly simple and independent, and I might have said > before I'll look into it. > > For the executor I think we've agreed the "proper" solution is BNL or > something like that. Not sure how far are we from that, though, I > don't recall any recent updates (but maybe I just missed that, the > pgsql-hackers traffic is pretty insane). I wonder if we should get > something like the "rebalancing" I proposed before, which is not a 100% > fix but may at least reduce the negative impact. Do you happen have a link to those discussions? I'd be interested in following along. I also can't say I know what BNL stands for. > >> But James's query is still going to be terrible. > >> > >> Do you know if it's extreme skew (many tuples with the same key, just > >> a few scattered around in other keys), or simply too much data for > >> your work_mem setting? > > > >Given my description earlier (seq scan on a very large table), I > >assume it's likely the latter? If you think that's sufficiently > >likely, I'll leave it at that, or if not I could do calculation on > >that key to see how distributed it is. > > > > Depends where exactly is the seq scan - is it under the Hash? If yes, > how come we even pick hash join in this case? Hard to say without seeing > the query plan. As noted earlier, I've attached a redacted plan here. The seq scan is under the parallel hash. Note that also as hinted at earlier, this happens with cpu_tuple_cost and cpu_operator_cost both set to 0.5, which has been a long-standing tweak on this cluster to affect some plans that are otherwise painfully bad, but really should be targeted at specific queries. With those reset, everything stays the same except that the hash join turns into a nested loop, and instead of a hash on the inner side of the join there's an index scan (the join key is indexed on that table). Because of the modified cpu_*_cost gucs we have two ways around this specific query plan: either reset those gucs or set enable_parallel_hash = off. But even if it means poor performance, it seems to me that we wouldn't want to fail the query. I can confirm that it is in fact capable of completing with this plan even with work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would have been some level of unsurprising in this batch system, and also much closer to "normal" for the non-parallel plan we used to have in 9.6. We were able to see this result on replicas while trying to find out exactly how to reproduce the error (it seems sometimes it was right under the boundary needed to raise the error). James
Attachment
On Sat, Nov 09, 2019 at 08:23:23AM -0500, James Coleman wrote: >Just realized I accidentally replied only to Tomas. > >On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote: >> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> >> >> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote: >> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: >> >> > > >ERROR: invalid DSA memory alloc request size 1375731712 >> >> >> >> > > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches >> >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at >> >> >> >> > > I've briefly looked at this today, and I think the root cause is >> >> > > somewhat similar to what is described in [1] where we simply increase >> >> > > the number of batches in an effort to keep batch contents in work_mem, >> >> > > but ignoring that each batch requires quite a bit of memory. So we end >> >> > > up with a lot of batches where each is small enough to fit into >> >> > > work_mem, but we need much more than work_mem to track the batches. >> >> >> >> Yeah. So even when this is fixed, the query is going to perform >> >> *terribly*, opening and closing millions of files in random order to >> >> stream tuples into, if this is case where there really are tuples to >> >> go to all partitions (and not just a case of extreme skew that our >> >> extreme skew detector fails to detect because it only detects absolute >> >> extreme skew). >> > >> >work_mem in the repro case is 500MB (the original failure was at >> >150MB). I realize that's too small for this query, though it's also >> >worth knowing that if I get rid of some other cluster-wide tunings >> >that shouldn't have been cluster-wide original (modifications to >> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into >> >an index scan and no hash (and performs much better). >> > >> >> So is it a case of underestimate? I.e. does the Hash side expect much >> less data (rows) than it gets during execution? > >I've attached a redacted plan, but the estimate with workers planned = >6 is 986,975,281, which seems quite reasonable to be as the current >table count is 5,917,539,491. > Hmmm, but the expected row width is only 16B, and with 6M rows that's only about 90GB. So how come this needs 1TB temporary files? I'm sure there's a bit of overhead, but 10X seems a bit much. >> >I think this also correlates with us seeing ~TB spike in disk usage, >> >so your explanation of the lots of "small" files would seem to be >> >consistent with that. >> >> That's consistent with the data. 500MB and nbatch=2097152 is exactly >> 1TB, and there'll be some additional overhead. > >Good, that helps make sense of that piece of the puzzle. > >> >> > > This seems to be about the same problem, except that instead of >> >> > > forgeting about BufFile, the parallel hash join ignores this: >> >> > > >> >> > > pstate->batches = >> >> > > dsa_allocate0(hashtable->area, >> >> > > EstimateParallelHashJoinBatch(hashtable) * nbatch); >> >> >> >> Yeah, I failed to consider that possibility. I suppose it could be >> >> avoided with something like this (not tested, I will find a repro for >> >> this on Monday to convince myself that it's right): >> >> >> >> @@ -1246,7 +1246,10 @@ >> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable) >> >> } >> >> >> >> /* Don't keep growing if it's not >> >> helping or we'd overflow. */ >> >> - if (extreme_skew_detected || >> >> hashtable->nbatch >= INT_MAX / 2) >> >> + if (extreme_skew_detected || >> >> + hashtable->nbatch >= INT_MAX / 2 || >> >> + >> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) * >> >> + >> >> hashtable->nbatch * 2)) >> >> pstate->growth = PHJ_GROWTH_DISABLED; >> >> else if (space_exhausted) >> >> pstate->growth = >> >> PHJ_GROWTH_NEED_MORE_BATCHES; >> >> >> >> Probably, but that's the execution part of it. I think we should also >> consider this in ExecChooseHashTableSize, and just don't do PHJ when >> it exceeds work_mem from the very beginning. >> >> Once we start executing we probably can't do much better than what you >> proposed. In particular it doesn't make sense to limit the space by >> work_mem, unless we also tweak that limit because then the batch size >> increases arbitrarily. >> >> I think we need do something about this in PG13 - both while planning >> (considering BufFile and SharedTuplestore), and during execution. The >> planner part seems fairly simple and independent, and I might have said >> before I'll look into it. >> >> For the executor I think we've agreed the "proper" solution is BNL or >> something like that. Not sure how far are we from that, though, I >> don't recall any recent updates (but maybe I just missed that, the >> pgsql-hackers traffic is pretty insane). I wonder if we should get >> something like the "rebalancing" I proposed before, which is not a 100% >> fix but may at least reduce the negative impact. > >Do you happen have a link to those discussions? I'd be interested in >following along. I also can't say I know what BNL stands for. > I think the two main threads are these two: 1) accounting for memory used for BufFile during hash joins https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development 2) Avoiding hash join batch explosions with extreme skew and weird stats https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com The BNL means "block nested loop" which means we split the inner relation into blocks, and then do nested loop for (some of) them. This works as a fallback strategy for hash join - when a batch is too large to fit into work_mem, we can split it into blocks and do a BNL against the outer relation. >> >> But James's query is still going to be terrible. >> >> >> >> Do you know if it's extreme skew (many tuples with the same key, just >> >> a few scattered around in other keys), or simply too much data for >> >> your work_mem setting? >> > >> >Given my description earlier (seq scan on a very large table), I >> >assume it's likely the latter? If you think that's sufficiently >> >likely, I'll leave it at that, or if not I could do calculation on >> >that key to see how distributed it is. >> > >> >> Depends where exactly is the seq scan - is it under the Hash? If yes, >> how come we even pick hash join in this case? Hard to say without seeing >> the query plan. > >As noted earlier, I've attached a redacted plan here. > >The seq scan is under the parallel hash. Note that also as hinted at >earlier, this happens with cpu_tuple_cost and cpu_operator_cost both >set to 0.5, which has been a long-standing tweak on this cluster to >affect some plans that are otherwise painfully bad, but really should >be targeted at specific queries. With those reset, everything stays >the same except that the hash join turns into a nested loop, and >instead of a hash on the inner side of the join there's an index scan >(the join key is indexed on that table). > I wonder what the bad plans look like and why this fixes them, but it seems like a separate issue. >Because of the modified cpu_*_cost gucs we have two ways around this >specific query plan: either reset those gucs or set >enable_parallel_hash = off. But even if it means poor performance, it >seems to me that we wouldn't want to fail the query. I can confirm >that it is in fact capable of completing with this plan even with >work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would >have been some level of unsurprising in this batch system, and also >much closer to "normal" for the non-parallel plan we used to have in >9.6. We were able to see this result on replicas while trying to find >out exactly how to reproduce the error (it seems sometimes it was >right under the boundary needed to raise the error). > It's certainly true that completing a query is preferrable to failing. It does depend when we can identify that's the case - the later we realize the issue, the harder it's to fix it. If we notice that during planning, we can simply disable the hash join, which then forces using a different join method (and it may actually be quite fast). Once we start executing, it's way harder. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Nov 9, 2019 at 1:27 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Sat, Nov 09, 2019 at 08:23:23AM -0500, James Coleman wrote: > >Just realized I accidentally replied only to Tomas. > > > >On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > >> > >> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote: > >> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas.munro@gmail.com> wrote: > >> >> > >> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331@gmail.com> wrote: > >> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote: > >> >> > > >ERROR: invalid DSA memory alloc request size 1375731712 > >> >> > >> >> > > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches > >> >> > > >(hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at > >> >> > >> >> > > I've briefly looked at this today, and I think the root cause is > >> >> > > somewhat similar to what is described in [1] where we simply increase > >> >> > > the number of batches in an effort to keep batch contents in work_mem, > >> >> > > but ignoring that each batch requires quite a bit of memory. So we end > >> >> > > up with a lot of batches where each is small enough to fit into > >> >> > > work_mem, but we need much more than work_mem to track the batches. > >> >> > >> >> Yeah. So even when this is fixed, the query is going to perform > >> >> *terribly*, opening and closing millions of files in random order to > >> >> stream tuples into, if this is case where there really are tuples to > >> >> go to all partitions (and not just a case of extreme skew that our > >> >> extreme skew detector fails to detect because it only detects absolute > >> >> extreme skew). > >> > > >> >work_mem in the repro case is 500MB (the original failure was at > >> >150MB). I realize that's too small for this query, though it's also > >> >worth knowing that if I get rid of some other cluster-wide tunings > >> >that shouldn't have been cluster-wide original (modifications to > >> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into > >> >an index scan and no hash (and performs much better). > >> > > >> > >> So is it a case of underestimate? I.e. does the Hash side expect much > >> less data (rows) than it gets during execution? > > > >I've attached a redacted plan, but the estimate with workers planned = > >6 is 986,975,281, which seems quite reasonable to be as the current > >table count is 5,917,539,491. > > > > Hmmm, but the expected row width is only 16B, and with 6M rows that's > only about 90GB. So how come this needs 1TB temporary files? I'm sure > there's a bit of overhead, but 10X seems a bit much. I'm running the query again to confirm this is indeed the one that triggers that, though I'm reasonably confident it is as I was able to reproduce the disk spike on a barely used replica box. I don't suppose there's any way for the row width to be wrong, or not include some other variable (e.g., variable length fields)? The key value here is type citext, and is often 8 chars long, but sometimes much longer (I sampled some that were 34 chars long, for example). I'm not sure the actual distribution of those lengths. > >> >I think this also correlates with us seeing ~TB spike in disk usage, > >> >so your explanation of the lots of "small" files would seem to be > >> >consistent with that. > >> > >> That's consistent with the data. 500MB and nbatch=2097152 is exactly > >> 1TB, and there'll be some additional overhead. > > > >Good, that helps make sense of that piece of the puzzle. > > > >> >> > > This seems to be about the same problem, except that instead of > >> >> > > forgeting about BufFile, the parallel hash join ignores this: > >> >> > > > >> >> > > pstate->batches = > >> >> > > dsa_allocate0(hashtable->area, > >> >> > > EstimateParallelHashJoinBatch(hashtable) * nbatch); > >> >> > >> >> Yeah, I failed to consider that possibility. I suppose it could be > >> >> avoided with something like this (not tested, I will find a repro for > >> >> this on Monday to convince myself that it's right): > >> >> > >> >> @@ -1246,7 +1246,10 @@ > >> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable) > >> >> } > >> >> > >> >> /* Don't keep growing if it's not > >> >> helping or we'd overflow. */ > >> >> - if (extreme_skew_detected || > >> >> hashtable->nbatch >= INT_MAX / 2) > >> >> + if (extreme_skew_detected || > >> >> + hashtable->nbatch >= INT_MAX / 2 || > >> >> + > >> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) * > >> >> + > >> >> hashtable->nbatch * 2)) > >> >> pstate->growth = PHJ_GROWTH_DISABLED; > >> >> else if (space_exhausted) > >> >> pstate->growth = > >> >> PHJ_GROWTH_NEED_MORE_BATCHES; > >> >> > >> > >> Probably, but that's the execution part of it. I think we should also > >> consider this in ExecChooseHashTableSize, and just don't do PHJ when > >> it exceeds work_mem from the very beginning. > >> > >> Once we start executing we probably can't do much better than what you > >> proposed. In particular it doesn't make sense to limit the space by > >> work_mem, unless we also tweak that limit because then the batch size > >> increases arbitrarily. > >> > >> I think we need do something about this in PG13 - both while planning > >> (considering BufFile and SharedTuplestore), and during execution. The > >> planner part seems fairly simple and independent, and I might have said > >> before I'll look into it. > >> > >> For the executor I think we've agreed the "proper" solution is BNL or > >> something like that. Not sure how far are we from that, though, I > >> don't recall any recent updates (but maybe I just missed that, the > >> pgsql-hackers traffic is pretty insane). I wonder if we should get > >> something like the "rebalancing" I proposed before, which is not a 100% > >> fix but may at least reduce the negative impact. > > > >Do you happen have a link to those discussions? I'd be interested in > >following along. I also can't say I know what BNL stands for. > > > > I think the two main threads are these two: > > 1) accounting for memory used for BufFile during hash joins > https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development > > 2) Avoiding hash join batch explosions with extreme skew and weird stats > https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com > > The BNL means "block nested loop" which means we split the inner > relation into blocks, and then do nested loop for (some of) them. This > works as a fallback strategy for hash join - when a batch is too large > to fit into work_mem, we can split it into blocks and do a BNL against > the outer relation. Thanks. > >> >> But James's query is still going to be terrible. > >> >> > >> >> Do you know if it's extreme skew (many tuples with the same key, just > >> >> a few scattered around in other keys), or simply too much data for > >> >> your work_mem setting? > >> > > >> >Given my description earlier (seq scan on a very large table), I > >> >assume it's likely the latter? If you think that's sufficiently > >> >likely, I'll leave it at that, or if not I could do calculation on > >> >that key to see how distributed it is. > >> > > >> > >> Depends where exactly is the seq scan - is it under the Hash? If yes, > >> how come we even pick hash join in this case? Hard to say without seeing > >> the query plan. > > > >As noted earlier, I've attached a redacted plan here. > > > >The seq scan is under the parallel hash. Note that also as hinted at > >earlier, this happens with cpu_tuple_cost and cpu_operator_cost both > >set to 0.5, which has been a long-standing tweak on this cluster to > >affect some plans that are otherwise painfully bad, but really should > >be targeted at specific queries. With those reset, everything stays > >the same except that the hash join turns into a nested loop, and > >instead of a hash on the inner side of the join there's an index scan > >(the join key is indexed on that table). > > > > I wonder what the bad plans look like and why this fixes them, but it > seems like a separate issue. I am going to look into those further, but it's been configured this way for some time, so it's always a bit messy to find all of these affected queries. > >Because of the modified cpu_*_cost gucs we have two ways around this > >specific query plan: either reset those gucs or set > >enable_parallel_hash = off. But even if it means poor performance, it > >seems to me that we wouldn't want to fail the query. I can confirm > >that it is in fact capable of completing with this plan even with > >work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would > >have been some level of unsurprising in this batch system, and also > >much closer to "normal" for the non-parallel plan we used to have in > >9.6. We were able to see this result on replicas while trying to find > >out exactly how to reproduce the error (it seems sometimes it was > >right under the boundary needed to raise the error). > > > > It's certainly true that completing a query is preferrable to failing. > It does depend when we can identify that's the case - the later we > realize the issue, the harder it's to fix it. If we notice that during > planning, we can simply disable the hash join, which then forces using a > different join method (and it may actually be quite fast). Once we start > executing, it's way harder. In this case though the failure in some sense seems fairly artificial. Aside from the query being slow, there doesn't appear to be any real limitation on the query completing. The box in question has 768GB of memory, so limiting this memory structure to 1GB seems artificial. James
On Sun, Nov 10, 2019 at 9:05 AM James Coleman <jtc331@gmail.com> wrote: > In this case though the failure in some sense seems fairly artificial. > Aside from the query being slow, there doesn't appear to be any real > limitation on the query completing. The box in question has 768GB of > memory, so limiting this memory structure to 1GB seems artificial. Yeah, completely artificial. We limit ourselves to MaxAllocSize, a self-imposed limit that applies to palloc() and dsa_allocate(), probably dating back to a time in history when anyone asking for more than that must surely be out of their mind. Then there are a few places where we allow ourselves to go past that limit by passing in an extra flag MCXT_ALLOC_HUGE or DSA_ALLOC_HUGE: from a quick grep, that's the bitmap heap scan page table, and anything using simplehash.h's default allocator (so I guess that includes hash agg; that's interesting, we can do a 64GB-bucket-array hash agg but not hash join). As noted on https://wiki.postgresql.org/wiki/Hash_Join, there are a couple of artificial constraints on hash joins: the limit on the number of hash join buckets which comes entirely from using 'int' as a data type for bucket numbers (an anachronism from pre-standard C or maybe just 32 bit-ism: the correct type is surely size_t, which is by definition big enough for any array that you could possibly address), and the MaxAllocSize thing. Easy to fix, of course. I noticed that when working on PHJ and decided to keep the same restriction for the new parallel code paths, because (1) it seemed like a policy choice we should make and then apply to both, and (2) it does provide some kind of sanity check, though it's increasingly looking overdue to be removed (in other words: previously I was only speculating about people showing up with ~1TB RAM machines and running into these ancient limits, but ... here we are). For example, suppose we added the DSA_ALLOC_HUGE flag to the line that is failing in your case. Now it would be able to blow through that 1GB limit, but where would it stop? Until we know how you're reaching this state, it's hard to know whether it'd go to (say) 2GB, and then work perfectly, job done, or whether it'd keep going until it ate all your RAM and made your users really unhappy. I think this must be a case of extreme skew, as complained about in[1]. Let's see... you have ~6 billion rows, and you said the planner knew that (it estimated about a billion, when there were 6 workers, so it's in the ball park). You didn't say how many batches the planner planned for. Let's see if I can guess... 256 or 512? That'd allow for 6 billion * 16 byte rows + some slop, chopped up into a power-of-two number of partitions that fit inside 500MB. Then at execution time, they didn't fit, and we went into repartition-until-it-fits mode. At some point we tried to cut them into ~2 million partitions and hit this error. That'd be a paltry 3,000 rows per partition if the keys were uniformly distributed, but it thinks that at least one partition is not fitting into 500MB. Conclusion: unless this is a carefully crafted hash attack, there must be one particular key has more than 500MB worth of rows, but also a smattering of other keys that fall into the same partition, that are causing us to keep trying to repartition until it eventually squeezes all of them all in the same direction during a split (this requires repeatedly splitting partitions until you reach one partition per tuple!). So I'm wondering if this would be fixed by, say, a 95% threshold (instead of 100%) for the extreme skew detector, as I proposed in a patch in the first email in that thread that later veered off into the BNL discussion[1]. Unfortunately that patch only deals with non-parallel HJ, but a PHJ version should be easy. And if not by 95%, I wonder what threshold would be needed for this case, and what kind of maths to use to think about it. If I wrote a patch like [1] with PHJ support, would you be able to test it on a copy of your workload? [1] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com
On Sun, Nov 10, 2019 at 7:27 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Hmmm, but the expected row width is only 16B, and with 6M rows that's > only about 90GB. So how come this needs 1TB temporary files? I'm sure > there's a bit of overhead, but 10X seems a bit much. (s/6M/6B/) Yeah, that comes out to only ~90GB but ... PHJ doesn't immediately unlink files from the previous generation when it repartitions. You need at two generations' worth of files (old and new) while repartitioning, but you don't need the grand-parent generation. I didn't think this was a problem because I didn't expect to have to repartition many times (and there is a similar but different kind of amplification in the non-parallel code). If this problem is due to the 100% extreme skew threshold causing us to go berserk, then that 10X multiplier is of the right order, if you imagine this thing started out with ~512 batches and got up to ~1M batches before it blew a gasket.
On Sat, Nov 9, 2019 at 4:54 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Nov 10, 2019 at 9:05 AM James Coleman <jtc331@gmail.com> wrote: > > In this case though the failure in some sense seems fairly artificial. > > Aside from the query being slow, there doesn't appear to be any real > > limitation on the query completing. The box in question has 768GB of > > memory, so limiting this memory structure to 1GB seems artificial. > > Yeah, completely artificial. We limit ourselves to MaxAllocSize, a > self-imposed limit that applies to palloc() and dsa_allocate(), > probably dating back to a time in history when anyone asking for more > than that must surely be out of their mind. Then there are a few > places where we allow ourselves to go past that limit by passing in an > extra flag MCXT_ALLOC_HUGE or DSA_ALLOC_HUGE: from a quick grep, > that's the bitmap heap scan page table, and anything using > simplehash.h's default allocator (so I guess that includes hash agg; > that's interesting, we can do a 64GB-bucket-array hash agg but not > hash join). > > As noted on https://wiki.postgresql.org/wiki/Hash_Join, there are a > couple of artificial constraints on hash joins: the limit on the > number of hash join buckets which comes entirely from using 'int' as a > data type for bucket numbers (an anachronism from pre-standard C or > maybe just 32 bit-ism: the correct type is surely size_t, which is by > definition big enough for any array that you could possibly address), > and the MaxAllocSize thing. Easy to fix, of course. I noticed that > when working on PHJ and decided to keep the same restriction for the > new parallel code paths, because (1) it seemed like a policy choice we > should make and then apply to both, and (2) it does provide some kind > of sanity check, though it's increasingly looking overdue to be > removed (in other words: previously I was only speculating about > people showing up with ~1TB RAM machines and running into these > ancient limits, but ... here we are). Heh. We've run with some clusters at 2TB of memory for many years, but I also realize that's not all that common. > For example, suppose we added the DSA_ALLOC_HUGE flag to the line that > is failing in your case. Now it would be able to blow through that > 1GB limit, but where would it stop? Until we know how you're reaching > this state, it's hard to know whether it'd go to (say) 2GB, and then > work perfectly, job done, or whether it'd keep going until it ate all > your RAM and made your users really unhappy. Yeah, I think it's definitely non-obvious exactly what to do. In general I think preferring not failing queries is good, but on the other hand that could result in OOM-killing a bunch of other things too and/or failing other queries because they can't alloc. > I think this must be a case of extreme skew, as complained about > in[1]. Let's see... you have ~6 billion rows, and you said the > planner knew that (it estimated about a billion, when there were 6 > workers, so it's in the ball park). You didn't say how many batches > the planner planned for. Let's see if I can guess... 256 or 512? I ran a successful one (this run happened to have max parallel workers per gather set to 8, and used 8 workers) at 150MB work_mem, and the parallel hash node had these stats: Buckets: 4194304 (originally 4194304) Batches: 32768 (originally 4096) Memory Usage: 344448kB Buffers: shared hit=38360 read=142422702, temp read=45766294 written=76912272 I'm not sure how to captures the batch information on a failing run, since you only get it on ANALYZE, but the query doesn't complete so you can't get that. I do notice in the original bug report stack trace though that: #3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches (hashtable=hashtable@entry=0x55a7c1db2740, nbatch=nbatch@entry=2097152) at ./build/../src/backend/executor/nodeHash.c:2889 which I think tells us the number of batches? That buffers written is ~615GB which corresponds to one blip on the disk graphs for this run (that was around 600GB) and the HashAggregate node at the top of the query (which is only grouping on items.account_id; see below, but this has significant skew) had: Buffers: shared hit=608075044 read=142422702, temp read=77222547 written=77284432 which corresponds to the second disk graph blip of similar size. Side note about explain output: it seems that the hash aggregate node appears to not output bucket stats. I've also noticed that the nested loop joins appear to not output the join condition, but hash joins output the hash condition, and if I remember correctly, merge joins output the join condition. If I'm right about that...I guess it means another thing on the list of potential patches. Seems like it'd make understanding plans easier. > That'd allow for 6 billion * 16 byte rows + some slop, chopped up into > a power-of-two number of partitions that fit inside 500MB. Then at > execution time, they didn't fit, and we went into > repartition-until-it-fits mode. At some point we tried to cut them > into ~2 million partitions and hit this error. That'd be a paltry > 3,000 rows per partition if the keys were uniformly distributed, but > it thinks that at least one partition is not fitting into 500MB. > Conclusion: unless this is a carefully crafted hash attack, there must > be one particular key has more than 500MB worth of rows, but also a > smattering of other keys that fall into the same partition, that are > causing us to keep trying to repartition until it eventually squeezes > all of them all in the same direction during a split (this requires > repeatedly splitting partitions until you reach one partition per > tuple!). So I'm wondering if this would be fixed by, say, a 95% > threshold (instead of 100%) for the extreme skew detector, as I > proposed in a patch in the first email in that thread that later > veered off into the BNL discussion[1]. Unfortunately that patch only > deals with non-parallel HJ, but a PHJ version should be easy. And if > not by 95%, I wonder what threshold would be needed for this case, and > what kind of maths to use to think about it. So I should have run the earlier attached plan with VERBOSE, but here's the interesting thing: the parallel hash node's seq scan node outputs two columns: let's call them (from the redacted plan) items.system_id and items.account_id. The first (system_id) is both not null and unique; the second (account_id) definitely has massive skew. I'm not very up-to-speed on how the hash building works, but I would have (perhaps naïvely?) assumed that the first column being unique would make the hash keys very likely not to collide in any significantly skewed way. Am I missing something here? > If I wrote a patch like > [1] with PHJ support, would you be able to test it on a copy of your > workload? I think that would be a significant effort to pull off, but _maybe_ not entire impossible. Certainly quite difficult though. > [1] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com
On Sat, Nov 9, 2019 at 6:14 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Nov 10, 2019 at 7:27 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > Hmmm, but the expected row width is only 16B, and with 6M rows that's > > only about 90GB. So how come this needs 1TB temporary files? I'm sure > > there's a bit of overhead, but 10X seems a bit much. > > (s/6M/6B/) Yeah, that comes out to only ~90GB but ... PHJ doesn't > immediately unlink files from the previous generation when it > repartitions. You need at two generations' worth of files (old and > new) while repartitioning, but you don't need the grand-parent > generation. I didn't think this was a problem because I didn't expect > to have to repartition many times (and there is a similar but > different kind of amplification in the non-parallel code). If this > problem is due to the 100% extreme skew threshold causing us to go > berserk, then that 10X multiplier is of the right order, if you > imagine this thing started out with ~512 batches and got up to ~1M > batches before it blew a gasket. Are you saying that it also doesn't unlink the grand-parent until the end?
On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote: > On Sat, Nov 9, 2019 at 6:14 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Nov 10, 2019 at 7:27 AM Tomas Vondra > > <tomas.vondra@2ndquadrant.com> wrote: > > > Hmmm, but the expected row width is only 16B, and with 6M rows that's > > > only about 90GB. So how come this needs 1TB temporary files? I'm sure > > > there's a bit of overhead, but 10X seems a bit much. > > > > (s/6M/6B/) Yeah, that comes out to only ~90GB but ... PHJ doesn't > > immediately unlink files from the previous generation when it > > repartitions. You need at two generations' worth of files (old and > > new) while repartitioning, but you don't need the grand-parent > > generation. I didn't think this was a problem because I didn't expect > > to have to repartition many times (and there is a similar but > > different kind of amplification in the non-parallel code). If this > > problem is due to the 100% extreme skew threshold causing us to go > > berserk, then that 10X multiplier is of the right order, if you > > imagine this thing started out with ~512 batches and got up to ~1M > > batches before it blew a gasket. > > Are you saying that it also doesn't unlink the grand-parent until the end? Yeah, it doesn't unlink anything until the end, which obviously isn't great in this case. Might be something to add to my list of patches to write.
On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote: > So I should have run the earlier attached plan with VERBOSE, but > here's the interesting thing: the parallel hash node's seq scan node > outputs two columns: let's call them (from the redacted plan) > items.system_id and items.account_id. The first (system_id) is both > not null and unique; the second (account_id) definitely has massive > skew. I'm not very up-to-speed on how the hash building works, but I > would have (perhaps naïvely?) assumed that the first column being > unique would make the hash keys very likely not to collide in any > significantly skewed way. Am I missing something here? Hrm. So the compound key is unique then. I was assuming up until now that it had duplicates. The hashes of the individual keys are combined (see ExecHashGetHashValue()), so assuming there is nothing funky about the way citext gets hashed (and it's not jumping out at me), your unique keys should give you uniform hash values and thus partition size, and repartitioning should be an effective way of reducing hash table size. So now it sounds like you have a simple case of underestimation, but now I'm confused about how you got a 344MB hash table with work_mem = 150MB: Buckets: 4194304 (originally 4194304) Batches: 32768 (originally 4096) Memory Usage: 344448kB And I'm confused about what was different when it wanted the crazy number of batches.
On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote: > > So I should have run the earlier attached plan with VERBOSE, but > > here's the interesting thing: the parallel hash node's seq scan node > > outputs two columns: let's call them (from the redacted plan) > > items.system_id and items.account_id. The first (system_id) is both > > not null and unique; the second (account_id) definitely has massive > > skew. I'm not very up-to-speed on how the hash building works, but I > > would have (perhaps naïvely?) assumed that the first column being > > unique would make the hash keys very likely not to collide in any > > significantly skewed way. Am I missing something here? > > Hrm. So the compound key is unique then. I was assuming up until now > that it had duplicates. The hashes of the individual keys are > combined (see ExecHashGetHashValue()), so assuming there is nothing > funky about the way citext gets hashed (and it's not jumping out at > me), your unique keys should give you uniform hash values and thus > partition size, and repartitioning should be an effective way of > reducing hash table size. So now it sounds like you have a simple > case of underestimation, but now I'm confused about how you got a > 344MB hash table with work_mem = 150MB: Looking at the source, citext's hash is a pretty standard call to hash_any, so I don't see how that would lead to any oddities (barring an intentional hash collision etc., but this is real data). Do you have any theories about where the underestimation is happening? It knows the number of rows reasonably well. The one thing I'm not sure about yet is the row width = 16. The account_id is a bigint, so we can assume it knows the size properly there. The system_id being citext...I think the 8 bytes it has for that is probably a bit low on average, but I'm not sure yet by how much (I'm going to run a query to find out). Hmm. Query came back, and average length is just shy of 7.9...so the 16 byte row size is looking pretty good. So I'm at a loss of why/what it would be underestimating (does it know about uniqueness/non-uniqueness? could that be a factor?)? I also don't know why it seems to regularly fail on the primary, but not on the sync, unless we adjust the work_mem up. I've double-checked all GUCs and the only differences are things related to replication that you'd expect to be different on primary/replica. I know there are some things that execute differently on replicas, so I assume it's something like that, but I don't know what specifically would cause this here. > Buckets: 4194304 (originally 4194304) Batches: 32768 (originally > 4096) Memory Usage: 344448kB > > And I'm confused about what was different when it wanted the crazy > number of batches. I'm not quite sure how to find out; if you have any ideas, I'd love to hear them. The one thing I can think of to try is to slowly increase work_mem (since setting it to 500MB reproduced the error on the replica) and see if the bucket info starts to trend up. James
On Sun, Nov 10, 2019 at 09:15:06AM -0500, James Coleman wrote: >On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> >> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote: >> > So I should have run the earlier attached plan with VERBOSE, but >> > here's the interesting thing: the parallel hash node's seq scan node >> > outputs two columns: let's call them (from the redacted plan) >> > items.system_id and items.account_id. The first (system_id) is both >> > not null and unique; the second (account_id) definitely has massive >> > skew. I'm not very up-to-speed on how the hash building works, but I >> > would have (perhaps naïvely?) assumed that the first column being >> > unique would make the hash keys very likely not to collide in any >> > significantly skewed way. Am I missing something here? >> >> Hrm. So the compound key is unique then. I was assuming up until now >> that it had duplicates. The hashes of the individual keys are >> combined (see ExecHashGetHashValue()), so assuming there is nothing >> funky about the way citext gets hashed (and it's not jumping out at >> me), your unique keys should give you uniform hash values and thus >> partition size, and repartitioning should be an effective way of >> reducing hash table size. So now it sounds like you have a simple >> case of underestimation, but now I'm confused about how you got a >> 344MB hash table with work_mem = 150MB: > Why would the compound key matter? The join only does this Hash Cond: (group_details.group_number = items.system_id) so the only thing that really matters when it comes to skew is system_id, i.e. the citext column. The other column is used by aggregation, but that's irrelevant here, no? >Looking at the source, citext's hash is a pretty standard call to >hash_any, so I don't see how that would lead to any oddities (barring >an intentional hash collision etc., but this is real data). > >Do you have any theories about where the underestimation is happening? >It knows the number of rows reasonably well. The one thing I'm not >sure about yet is the row width = 16. The account_id is a bigint, so >we can assume it knows the size properly there. The system_id being >citext...I think the 8 bytes it has for that is probably a bit low on >average, but I'm not sure yet by how much (I'm going to run a query to >find out). Hmm. Query came back, and average length is just shy of >7.9...so the 16 byte row size is looking pretty good. So I'm at a loss >of why/what it would be underestimating (does it know about >uniqueness/non-uniqueness? could that be a factor?)? > I think the easiest thing we can do is running a couple of queries collecting useful stats, like -- the most frequent system_id values select system_id, count(*), sum(length(system_id)) from items gtoup by system_id order by count(*) desc limit 100; -- the largest group by system_id select system_id, count(*), sum(length(system_id)) from items gtoup by system_id order by sum(length(system_id)) desc limit 100; That should tell us if there's something off. >I also don't know why it seems to regularly fail on the primary, but >not on the sync, unless we adjust the work_mem up. I've double-checked >all GUCs and the only differences are things related to replication >that you'd expect to be different on primary/replica. > >I know there are some things that execute differently on replicas, so >I assume it's something like that, but I don't know what specifically >would cause this here. > Hmm, I haven't realized it fails on primary more often. That's certainly strange, I don't have a very good idea why that could be happening. Do the query plans look the same? How does the EXPLAIN ANALYZE look like? >> Buckets: 4194304 (originally 4194304) Batches: 32768 (originally >> 4096) Memory Usage: 344448kB >> >> And I'm confused about what was different when it wanted the crazy >> number of batches. > >I'm not quite sure how to find out; if you have any ideas, I'd love to >hear them. The one thing I can think of to try is to slowly increase >work_mem (since setting it to 500MB reproduced the error on the >replica) and see if the bucket info starts to trend up. > Chances are it actually happened to detect skew and disabled the growth, hence the table grew above work_mem. We should probably add info about this (skew, disabling growth) to EXPLAIN ANALYZE verbose mode. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Nov 10, 2019 at 12:29 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Sun, Nov 10, 2019 at 09:15:06AM -0500, James Coleman wrote: > >On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <thomas.munro@gmail.com> wrote: > >> > >> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote: > >> > So I should have run the earlier attached plan with VERBOSE, but > >> > here's the interesting thing: the parallel hash node's seq scan node > >> > outputs two columns: let's call them (from the redacted plan) > >> > items.system_id and items.account_id. The first (system_id) is both > >> > not null and unique; the second (account_id) definitely has massive > >> > skew. I'm not very up-to-speed on how the hash building works, but I > >> > would have (perhaps naīvely?) assumed that the first column being > >> > unique would make the hash keys very likely not to collide in any > >> > significantly skewed way. Am I missing something here? > >> > >> Hrm. So the compound key is unique then. I was assuming up until now > >> that it had duplicates. The hashes of the individual keys are > >> combined (see ExecHashGetHashValue()), so assuming there is nothing > >> funky about the way citext gets hashed (and it's not jumping out at > >> me), your unique keys should give you uniform hash values and thus > >> partition size, and repartitioning should be an effective way of > >> reducing hash table size. So now it sounds like you have a simple > >> case of underestimation, but now I'm confused about how you got a > >> 344MB hash table with work_mem = 150MB: > > > > Why would the compound key matter? The join only does this > > Hash Cond: (group_details.group_number = items.system_id) > > so the only thing that really matters when it comes to skew is > system_id, i.e. the citext column. The other column is used by > aggregation, but that's irrelevant here, no? Yeah, that seems right. I think I got confused because account_id is used in another join (well, technically a correlated subquery, but effectively a join in implementation). But even so, system_id is unique and not null. So the above discussion of how uniqueness affects this still holds, right? > >Looking at the source, citext's hash is a pretty standard call to > >hash_any, so I don't see how that would lead to any oddities (barring > >an intentional hash collision etc., but this is real data). > > > >Do you have any theories about where the underestimation is happening? > >It knows the number of rows reasonably well. The one thing I'm not > >sure about yet is the row width = 16. The account_id is a bigint, so > >we can assume it knows the size properly there. The system_id being > >citext...I think the 8 bytes it has for that is probably a bit low on > >average, but I'm not sure yet by how much (I'm going to run a query to > >find out). Hmm. Query came back, and average length is just shy of > >7.9...so the 16 byte row size is looking pretty good. So I'm at a loss > >of why/what it would be underestimating (does it know about > >uniqueness/non-uniqueness? could that be a factor?)? > > > > I think the easiest thing we can do is running a couple of queries > collecting useful stats, like > > -- the most frequent system_id values > select system_id, count(*), sum(length(system_id)) from items > gtoup by system_id > order by count(*) desc limit 100; > > -- the largest group by system_id > select system_id, count(*), sum(length(system_id)) from items > gtoup by system_id > order by sum(length(system_id)) desc limit 100; > > That should tell us if there's something off. Since the column is unique, we don't need to do any grouping to get stats; as noted earlier I queried and found that the average length was just shy of 7.9 chars long (though I happen to know the length varies at least between 7 and 34). > >I also don't know why it seems to regularly fail on the primary, but > >not on the sync, unless we adjust the work_mem up. I've double-checked > >all GUCs and the only differences are things related to replication > >that you'd expect to be different on primary/replica. > > > >I know there are some things that execute differently on replicas, so > >I assume it's something like that, but I don't know what specifically > >would cause this here. > > > > Hmm, I haven't realized it fails on primary more often. That's certainly > strange, I don't have a very good idea why that could be happening. Do > the query plans look the same? How does the EXPLAIN ANALYZE look like? So far they all seem to look very similar. The other possibility is that it's just that the replica is behind (it's not a sync replica), but this has happened enough times on different days that that explanation feels pretty unsatisfactory to me. As mentioned in my previous email, I'm slowly checking against various work_mem values to see if I can determine 1.) where it starts to only fail on the primary and 2.) if the buckets stats start to grow before the failure case. So far I've seen it fail on both primary and sync at 500MB and 400MB, 300MB, and 200MB, and, interestingly, with today's data, succeed on both at 150MB. That last point does imply data changing could have something to do with the sync/replica difference I suppose. ...And, it just failed on the replica at 150MB and succeeded on the primary at 150MB. So...probably data changing? In all of those runs I haven't seen the bucket stats change. > >> Buckets: 4194304 (originally 4194304) Batches: 32768 (originally > >> 4096) Memory Usage: 344448kB > >> > >> And I'm confused about what was different when it wanted the crazy > >> number of batches. > > > >I'm not quite sure how to find out; if you have any ideas, I'd love to > >hear them. The one thing I can think of to try is to slowly increase > >work_mem (since setting it to 500MB reproduced the error on the > >replica) and see if the bucket info starts to trend up. > > > > Chances are it actually happened to detect skew and disabled the growth, > hence the table grew above work_mem. We should probably add info about > this (skew, disabling growth) to EXPLAIN ANALYZE verbose mode. Would skew be likely though if the column is unique?
I think I see what's happening: we're running out of hash bits. > Buckets: 4194304 (originally 4194304) Batches: 32768 (originally 4096) Memory Usage: 344448kB Here it's using the lower 22 bits for the bucket number, and started out using 12 bits for the batch (!), and increased that until it got to 15 (!!). After using 22 bits for the bucket, there are only 10 bits left, so all the tuples go into the lower 1024 batches. I'm not sure how exactly this leads to wildly varying numbers of repartioning cycles (the above-quoted example did it 3 times, the version that crashed into MaxAllocSize did it ~10 times). Besides switching to 64 bit hashes so that we don't run out of information (clearly a good idea), what other options do we have? (1) We could disable repartitioning (give up on work_mem) after we've run out of bits; this will eat more memory than it should. (2) You could start stealing bucket bits; this will eat more CPU than it should, because you'd effectively have fewer active buckets (tuples will concentrated on the bits you didn't steal).
On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote: >I think I see what's happening: we're running out of hash bits. > >> Buckets: 4194304 (originally 4194304) Batches: 32768 (originally 4096) Memory Usage: 344448kB > >Here it's using the lower 22 bits for the bucket number, and started >out using 12 bits for the batch (!), and increased that until it got >to 15 (!!). After using 22 bits for the bucket, there are only 10 >bits left, so all the tuples go into the lower 1024 batches. > Ouch! >I'm not sure how exactly this leads to wildly varying numbers of >repartioning cycles (the above-quoted example did it 3 times, the >version that crashed into MaxAllocSize did it ~10 times). > >Besides switching to 64 bit hashes so that we don't run out of >information (clearly a good idea), what other options do we have? (1) >We could disable repartitioning (give up on work_mem) after we've run >out of bits; this will eat more memory than it should. (2) You could >start stealing bucket bits; this will eat more CPU than it should, >because you'd effectively have fewer active buckets (tuples will >concentrated on the bits you didn't steal). Can't we simply compute two hash values, using different seeds - one for bucket and the other for batch? Of course, that'll be more expensive. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thomas Munro <thomas.munro@gmail.com> writes: > I think I see what's happening: we're running out of hash bits. Ugh. > Besides switching to 64 bit hashes so that we don't run out of > information (clearly a good idea), what other options do we have? If the input column is less than 64 bits wide, this is an illusory solution, because there can't really be that many independent hash bits. I think we need to have some constraint that keeps us from widening the batch+bucket counts to more hash bits than we have. regards, tom lane
On Sun, Nov 10, 2019 at 4:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > I think I see what's happening: we're running out of hash bits. > > > Buckets: 4194304 (originally 4194304) Batches: 32768 (originally 4096) Memory Usage: 344448kB > > Here it's using the lower 22 bits for the bucket number, and started > out using 12 bits for the batch (!), and increased that until it got > to 15 (!!). After using 22 bits for the bucket, there are only 10 > bits left, so all the tuples go into the lower 1024 batches. > > I'm not sure how exactly this leads to wildly varying numbers of > repartioning cycles (the above-quoted example did it 3 times, the > version that crashed into MaxAllocSize did it ~10 times). > > Besides switching to 64 bit hashes so that we don't run out of > information (clearly a good idea), what other options do we have? (1) > We could disable repartitioning (give up on work_mem) after we've run > out of bits; this will eat more memory than it should. (2) You could > start stealing bucket bits; this will eat more CPU than it should, > because you'd effectively have fewer active buckets (tuples will > concentrated on the bits you didn't steal). Thanks for working through this! But now we're at the end of my understanding of how hash tables and joins are implemented in PG; is there a wiki page or design that might give me some current design description of how the buckets and batches work with the hash so I can keep following along? James
On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote: >On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote: >>I think I see what's happening: we're running out of hash bits. >> >>>Buckets: 4194304 (originally 4194304) Batches: 32768 (originally 4096) Memory Usage: 344448kB >> >>Here it's using the lower 22 bits for the bucket number, and started >>out using 12 bits for the batch (!), and increased that until it got >>to 15 (!!). After using 22 bits for the bucket, there are only 10 >>bits left, so all the tuples go into the lower 1024 batches. >> > >Ouch! > >>I'm not sure how exactly this leads to wildly varying numbers of >>repartioning cycles (the above-quoted example did it 3 times, the >>version that crashed into MaxAllocSize did it ~10 times). >> >>Besides switching to 64 bit hashes so that we don't run out of >>information (clearly a good idea), what other options do we have? (1) >>We could disable repartitioning (give up on work_mem) after we've run >>out of bits; this will eat more memory than it should. (2) You could >>start stealing bucket bits; this will eat more CPU than it should, >>because you'd effectively have fewer active buckets (tuples will >>concentrated on the bits you didn't steal). > >Can't we simply compute two hash values, using different seeds - one for >bucket and the other for batch? Of course, that'll be more expensive. Meh, I realized that's pretty much just a different way to get 64-bit hashes (which is what you mentioned). An I think you're right we should detect cases when we use all the bits, and stop making it worse. Stealing bucket bits seems reasonable - we can't stop adding batches, because that would mean we stop enforcing work_mem. The question is how far that gets us - we enforce nbuckets to be at least 1024 (i.e. 10 bits), which leaves 32 bits for nbatch. And in one of the explains we've seen nbatch=2097152 (i.e. 21 bits). Of course, this is bound to be (extremely) slow. The hashjoin changes in 9.5 which reduced th hash table load factor from 10 to 1 resulted in speedups close to 3x. And if you go from 4194304 to 1024 buckets, those will be loooooooong chains in each bucket :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Nov 10, 2019 at 04:34:09PM -0500, Tom Lane wrote: >Thomas Munro <thomas.munro@gmail.com> writes: >> I think I see what's happening: we're running out of hash bits. > >Ugh. > >> Besides switching to 64 bit hashes so that we don't run out of >> information (clearly a good idea), what other options do we have? > >If the input column is less than 64 bits wide, this is an illusory >solution, because there can't really be that many independent hash bits. >I think we need to have some constraint that keeps us from widening >the batch+bucket counts to more hash bits than we have. > That's not quite correct - it's a solution/improvement as long as the values have more than 32 bits of information. I.e. it works even for values that are less than 64 bits wide, as long as there's more than 32 bits of information (assuming the hash does not lost the extra bits). In the case discussed in this text we're dealing with a citext column, with average length ~8 chars. So it probably has less than 64 bits of information, because it's likely just alphanumeric, or something like that. But likely more than 32 bits. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 11, 2019 at 10:38 AM James Coleman <jtc331@gmail.com> wrote: > But now we're at the end of my understanding of how hash tables and > joins are implemented in PG; is there a wiki page or design that might > give me some current design description of how the buckets and batches > work with the hash so I can keep following along? We have something like the so-called "Grace" hash join (with the "hybrid" refinement, irrelevant for this discussion): https://en.wikipedia.org/wiki/Hash_join#Grace_hash_join Our word "batch" just means partition. Most descriptions talk about using two different hash functions for partition and bucket, but our implementation uses a single hash function, and takes some of the bits to choose the bucket and some of the bits to choose the batch. That happens here: https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeHash.c#L1872
Hi, On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote: > On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote: > > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote: > > Can't we simply compute two hash values, using different seeds - one for > > bucket and the other for batch? Of course, that'll be more expensive. > > Meh, I realized that's pretty much just a different way to get 64-bit > hashes (which is what you mentioned). I'm not sure it's really the same, given practical realities in postgres. Right now the "extended" hash function supporting 64 bit hash functions is optional. So we couldn't unconditionally rely on it being present, even in master, unless we're prepared to declare it as required from now on. So computing two different hash values at the same time, by using a different IV and a different combine function, doesn't seem like an unreasonable approach. Greetings, Andres Freund
On Sun, Nov 10, 2019 at 02:46:31PM -0800, Andres Freund wrote: >Hi, > >On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote: >> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote: >> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote: >> > Can't we simply compute two hash values, using different seeds - one for >> > bucket and the other for batch? Of course, that'll be more expensive. >> >> Meh, I realized that's pretty much just a different way to get 64-bit >> hashes (which is what you mentioned). > >I'm not sure it's really the same, given practical realities in >postgres. Right now the "extended" hash function supporting 64 bit hash >functions is optional. So we couldn't unconditionally rely on it being >present, even in master, unless we're prepared to declare it as >required from now on. > >So computing two different hash values at the same time, by using a >different IV and a different combine function, doesn't seem like an >unreasonable approach. > True. I was commenting on the theoretical fact that computing two 32-bit hashes is close to computing a 64-bit hash, but you're right there are implementation details that may make it more usable in our case. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 11, 2019 at 12:44 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Sun, Nov 10, 2019 at 02:46:31PM -0800, Andres Freund wrote: > >On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote: > >> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote: > >> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote: > >> > Can't we simply compute two hash values, using different seeds - one for > >> > bucket and the other for batch? Of course, that'll be more expensive. > >> > >> Meh, I realized that's pretty much just a different way to get 64-bit > >> hashes (which is what you mentioned). > > > >I'm not sure it's really the same, given practical realities in > >postgres. Right now the "extended" hash function supporting 64 bit hash > >functions is optional. So we couldn't unconditionally rely on it being > >present, even in master, unless we're prepared to declare it as > >required from now on. > > > >So computing two different hash values at the same time, by using a > >different IV and a different combine function, doesn't seem like an > >unreasonable approach. > > True. I was commenting on the theoretical fact that computing two 32-bit > hashes is close to computing a 64-bit hash, but you're right there are > implementation details that may make it more usable in our case. Here is a quick sketch of something like that, for discussion only. I figured that simply mixing the hash value we have with some arbitrary bits afterwards would be just as good as having started with a different IV, which leads to a very simple change without refactoring. From quick experiments with unique keys (generate_series) I seem to get approximately even sized partitions, and correct answers, but I make no claim to strong hash-math-fu and haven't tested on very large inputs. Thoughts?
Attachment
On Mon, Nov 11, 2019 at 01:33:41PM +1300, Thomas Munro wrote: >On Mon, Nov 11, 2019 at 12:44 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> On Sun, Nov 10, 2019 at 02:46:31PM -0800, Andres Freund wrote: >> >On 2019-11-10 22:50:17 +0100, Tomas Vondra wrote: >> >> On Sun, Nov 10, 2019 at 10:23:52PM +0100, Tomas Vondra wrote: >> >> > On Mon, Nov 11, 2019 at 10:08:58AM +1300, Thomas Munro wrote: >> >> > Can't we simply compute two hash values, using different seeds - one for >> >> > bucket and the other for batch? Of course, that'll be more expensive. >> >> >> >> Meh, I realized that's pretty much just a different way to get 64-bit >> >> hashes (which is what you mentioned). >> > >> >I'm not sure it's really the same, given practical realities in >> >postgres. Right now the "extended" hash function supporting 64 bit hash >> >functions is optional. So we couldn't unconditionally rely on it being >> >present, even in master, unless we're prepared to declare it as >> >required from now on. >> > >> >So computing two different hash values at the same time, by using a >> >different IV and a different combine function, doesn't seem like an >> >unreasonable approach. >> >> True. I was commenting on the theoretical fact that computing two 32-bit >> hashes is close to computing a 64-bit hash, but you're right there are >> implementation details that may make it more usable in our case. > >Here is a quick sketch of something like that, for discussion only. I >figured that simply mixing the hash value we have with some arbitrary >bits afterwards would be just as good as having started with a >different IV, which leads to a very simple change without refactoring. >From quick experiments with unique keys (generate_series) I seem to >get approximately even sized partitions, and correct answers, but I >make no claim to strong hash-math-fu and haven't tested on very large >inputs. Thoughts? Hmmm, iteresting. Using a hard-coded constant seems a bit weird, although I don't have any precise argument why it's wrong yet. Essentially, the patch does this: batchno = hash_combine(0x9e3779b9, hashvalue) & (nbatch - 1); bucketno = hashvalue & (nbuckets - 1); but adding a constant does not 'propagate' any bits from the original value to the hash. So it seems pretty much equivalent to batchno = hashvalue & (nbatch - 1); bucketno = hashvalue & (nbuckets - 1); so we still operate with the same number of bits. It might solve this particular case, but I doubt it's a good idea in general ... I think we will have to actually compute two hashes, to get more than 32 bits. But yes, that'll be more invasive, and unlikely backpatchable. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Nov 10, 2019 at 4:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > I think I see what's happening: we're running out of hash bits. > > > Buckets: 4194304 (originally 4194304) Batches: 32768 (originally 4096) Memory Usage: 344448kB > > Here it's using the lower 22 bits for the bucket number, and started > out using 12 bits for the batch (!), and increased that until it got > to 15 (!!). After using 22 bits for the bucket, there are only 10 > bits left, so all the tuples go into the lower 1024 batches. Do we have this kind of problem with hash aggregates also? I've noticed the temp disk usage pattern applies to both, and the buffers stats shows that being the case, but unfortunately the hash aggregate node doesn't report memory usage for its hash or buckets info. Given it's not a join, maybe we only need buckets and not batches, but I don't know this part of the code at all, so I'm just guessing to assume either way. James
On Mon, Nov 11, 2019 at 09:14:43AM -0500, James Coleman wrote: >On Sun, Nov 10, 2019 at 4:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> >> I think I see what's happening: we're running out of hash bits. >> >> > Buckets: 4194304 (originally 4194304) Batches: 32768 (originally 4096) Memory Usage: 344448kB >> >> Here it's using the lower 22 bits for the bucket number, and started >> out using 12 bits for the batch (!), and increased that until it got >> to 15 (!!). After using 22 bits for the bucket, there are only 10 >> bits left, so all the tuples go into the lower 1024 batches. > >Do we have this kind of problem with hash aggregates also? I've >noticed the temp disk usage pattern applies to both, and the buffers >stats shows that being the case, but unfortunately the hash aggregate >node doesn't report memory usage for its hash or buckets info. Given >it's not a join, maybe we only need buckets and not batches, but I >don't know this part of the code at all, so I'm just guessing to >assume either way. > I don't think so, because hash aggregate does not spill to disk. The trouble with hashjoins is that we need two independent indexes - bucket and batch, and we only have a single 32-bit hash value. The hash agg is currently unable to spill to disk, so it's not affected by this (but it does have plenty of issues on it's own). There's work in progress aiming to add memory-bounded hash aggregate, but I think the spilling is supposed to work very differently there. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 11, 2019 at 11:46 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Essentially, the patch does this: > > batchno = hash_combine(0x9e3779b9, hashvalue) & (nbatch - 1); > bucketno = hashvalue & (nbuckets - 1); > > but adding a constant does not 'propagate' any bits from the original > value to the hash. So it seems pretty much equivalent to > > batchno = hashvalue & (nbatch - 1); > bucketno = hashvalue & (nbuckets - 1); > > so we still operate with the same number of bits. It might solve this > particular case, but I doubt it's a good idea in general ... Yeah, I think you're generally right, but it's much better than the specific coding batchno = hashvalue & (nbatch - 1). With that recipe, batch #42 finishes up with all its tuples crammed into bucket #42. That's the worst case of the steal-the-bucket-bits idea mentioned earlier. After sleeping on it, I think the re-hash idea is equivalent to stealing the minimum number of bucket bits. A simple but subtly broken version of that would be: take bucket number from the low end, and batch number from the high end, and don't worry if they begin to overlap because it's the best we can do: batchno = (hashvalue >> (hash_width - nbatch_width)) & (nbatch - 1); bucketno = hashvalue & (nbuckets - 1); Tuples start concentrating in a subset of buckets when the bit ranges begin to overlap, due to a correlation between batch and bucket, but that seems to be true for the re-hash idea too. You can't actually use that exact coding, though, because of the way we increase nbatch in serial hash joins: we're only allowed to add new bits to the high end of the generated batch numbers so that every tuple either stays in the current batch or is "thrown forward" to a higher-numbered batch when nbatch increases. So instead you could write it as: batchno = reverse_bits(hashvalue) & (nbatch - 1); bucketno = hashvalue & (nbuckets - 1); I'm not sure if reversing bits can be done any more efficiently than the cheap-and-cheerful hash_combine() code, though. With the right inlining it seems like nice straight-line no-jmp no-lookup-table code, so maybe it's still worth considering if it has equivalent performance for our purposes. A tangential observation is that with any of these approaches you can probably lift the current restriction on increasing nbucket when nbatch > 1 (that prohibition is based on not wanting overlapping bucket and batch bits ranges, but if we're deciding that's OK in extreme cases and taking steps to minimise it in usual cases...) > I think we will have to actually compute two hashes, to get more than 32 > bits. But yes, that'll be more invasive, and unlikely backpatchable. Agreed. I think if you started with a different IV *inside the hash function*, in other words, if you effectively had two different hash functions (which is the way Grace is usually described) so that the two hash function each get to see (potentially) more than 32 bits' worth of input, it'd be about the same as having a single 64 bit hash and using non-overlapping bit ranges for batch and bucket. Nothing like that seems plausible for back branches.
Hi, On 2019-11-11 11:46:05 +0100, Tomas Vondra wrote: > On Mon, Nov 11, 2019 at 01:33:41PM +1300, Thomas Munro wrote: > Hmmm, iteresting. Using a hard-coded constant seems a bit weird, > although I don't have any precise argument why it's wrong yet. > > Essentially, the patch does this: > > batchno = hash_combine(0x9e3779b9, hashvalue) & (nbatch - 1); > bucketno = hashvalue & (nbuckets - 1); > > but adding a constant does not 'propagate' any bits from the original > value to the hash. So it seems pretty much equivalent to I was over IM arguing that we ought to also use a different mixing functions when computing the hash. > batchno = hashvalue & (nbatch - 1); > bucketno = hashvalue & (nbuckets - 1); > so we still operate with the same number of bits. It might solve this > particular case, but I doubt it's a good idea in general ... > > I think we will have to actually compute two hashes, to get more than 32 > bits. But yes, that'll be more invasive, and unlikely backpatchable. I think it might also just generally not be acceptable, from a performance POV. Computing the hashes is already a major bottleneck for HJ. Penalizing everybody for these extreme cases doesn't strike me as great. It might be ok to compute a 64bit hash, but I don't see how computing two hashes in parallel wouldn't have significant overhead. I don't really see what the problem is with using just a differently mixed hash. There's not really a requirement for there to be a tremendous amount of independence between the bits used for the bucket, and the bits for the batch. We cannot just reuse exactly the same bits for the bucket that we used for the hash, as otherwise we'd just raise the number of bucket conflicts (as the whole hashtable will only contain tuples with the reduced number of bits). But I don't see why that implies a need for full bit indepence. As long as the mixing guarantees that we use the bits for bucketing differently than the ones for batches, I think we're good. All that need is that each batch is is roughly equally sized, and contains tuples with hashes that roughly equally distributed - you could bitreverse the hash for batch selection, and get that, no? Greetings, Andres Freund
On Thu, Nov 14, 2019 at 7:48 AM Andres Freund <andres@anarazel.de> wrote: > I don't really see what the problem is with using just a differently > mixed hash. There's not really a requirement for there to be a > tremendous amount of independence between the bits used for the bucket, > and the bits for the batch. We cannot just reuse exactly the same bits > for the bucket that we used for the hash, as otherwise we'd just raise > the number of bucket conflicts (as the whole hashtable will only contain > tuples with the reduced number of bits). But I don't see why that > implies a need for full bit indepence. As long as the mixing guarantees > that we use the bits for bucketing differently than the ones for > batches, I think we're good. All that need is that each batch is is > roughly equally sized, and contains tuples with hashes that roughly > equally distributed - you could bitreverse the hash for batch selection, > and get that, no? Right, that's what I said earlier: > batchno = reverse_bits(hashvalue) & (nbatch - 1); > bucketno = hashvalue & (nbuckets - 1); My theory is that batchno = hash_combine(<constant>, hashvalue) & (nbatch - 1) has about the same distribution properties. That is, it works nicely up until you have more than about 2^32 keys, and then it begins to break down due to lack of information, but the break down manifests as more conflicts in buckets, instead of the failure to repartition once you start reading off the end of the hash value (the cause of the present bug report). The bit reverse approach might be a bit less mysterious, but when I looked at ways to implement that they looked potentially slower than the xor, shifts and adds used by hash_combine. I didn't attempt to test that, though.
On Thu, Nov 14, 2019 at 9:20 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Nov 14, 2019 at 7:48 AM Andres Freund <andres@anarazel.de> wrote: > > I don't really see what the problem is with using just a differently > > mixed hash. There's not really a requirement for there to be a > > tremendous amount of independence between the bits used for the bucket, > > and the bits for the batch. We cannot just reuse exactly the same bits > > for the bucket that we used for the hash, as otherwise we'd just raise > > the number of bucket conflicts (as the whole hashtable will only contain > > tuples with the reduced number of bits). But I don't see why that > > implies a need for full bit indepence. As long as the mixing guarantees > > that we use the bits for bucketing differently than the ones for > > batches, I think we're good. All that need is that each batch is is > > roughly equally sized, and contains tuples with hashes that roughly > > equally distributed - you could bitreverse the hash for batch selection, > > and get that, no? > > Right, that's what I said earlier: > > > batchno = reverse_bits(hashvalue) & (nbatch - 1); > > bucketno = hashvalue & (nbuckets - 1); > > My theory is that batchno = hash_combine(<constant>, hashvalue) & > (nbatch - 1) has about the same distribution properties. That is, it > works nicely up until you have more than about 2^32 keys, and then it > begins to break down due to lack of information, but the break down > manifests as more conflicts in buckets, instead of the failure to > repartition once you start reading off the end of the hash value (the > cause of the present bug report). The bit reverse approach might be a > bit less mysterious, but when I looked at ways to implement that they > looked potentially slower than the xor, shifts and adds used by > hash_combine. I didn't attempt to test that, though. If you have plenty of disk space and time, you can repro the problem like so: create unlogged table sevenb as select generate_series(1, 7000000000)::bigint i; analyze sevenb; set work_mem = '150MB'; explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); If you have max_parallel_workers_per_gather >= 6 it gives up the strange idea of using a sort-merge join, otherwise you also need enable_mergejoin = off to prevent that, which is curious. You can see as soon as it starts executing the join that it's only writing to 1023 partitions (that is, 1023 * nprocesses temporary files with names like "i...of4096.p...", and then when one of those gets too big, "i...of8192.p...", and so on until something goes wrong). My (slow) machine hasn't finished repartitioning yet... I'll check again tomorrow, if it hasn't melted... Patched, it writes to 4095 as it should (there is no "i0of4096..."), their sizes are uniform, and the query completes after a single repartitioning operation. (Well it's 4096 with work_mem = 150MB and 2 workers, YMMV). I'm planning to push something like this late this week (though I think the constant might be better as 0 since it doesn't matter much given the definition, and I need a back-patchable version that is open coded or I need to back-patch hash_combine), unless someone has a better idea. Separately, I'm working on an idea for how to support extended hashing for the master branch only.
On Wed, Nov 20, 2019 at 1:05 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I'm planning to push something like this late this week (though I > think the constant might be better as 0 since it doesn't matter much > given the definition, and I need a back-patchable version that is open > coded or I need to back-patch hash_combine), unless someone has a > better idea. Scratch that plan, it turned out to be a terrible idea. Although it solves the batch problem, it's much worse than I expected at bucket distribution, so the performance is bad. I think something like that might work given better bit mixing, but that sounds expensive (or maybe I'm just being really stupid here). Anyway, back to the drawing board. I'm now thinking the bit stealing scheme (ie take bits from the other end of the hash, and just let them start to overlap with bucket bits if you have to when nbatch increases) is a better plan, and it is so much easier to reason about, and still in the realm of a handful of instructions. Here are some numbers[1] showing the number of non-empty buckets ("pop") and chain length distribution for non-empty buckets, given 7 billion bigints as input. method | nbatch | nbucket | pop | min | max | avg | stddev -------+--------+---------+--------+------+------+------+-------- rehash | 4096 | 1048576 | 256 | 6212 | 7001 | 6674 | 110 steal | 4096 | 1048576 | 659574 | 1 | 17 | 2.5 | 1.5 steal | 4096 | 4194304 | 659574 | 1 | 17 | 2.5 | 1.5 steal | 8192 | 4194304 | 329685 | 1 | 17 | 2.5 | 1.5 The rehash numbers are obviously catastrophic in this case, and that's before we've even run out of hash bits. The steal numbers show that you just waste memory on empty buckets when hash bits get thin on the ground (the last two lines in the table). We could perhaps consider automatically lowering nbucket in this case too, since you can't really have more than 2^32 virtual buckets, so pretending you can just wastes array memory. So here's a draft steal-the-bucket-bits patch. Yeah, reverse_bits_u32() may be in the wrong header. But is it too slow? On my desktop I can call ExecGetBucketAndBatch() 353 million times per second (~2.8ns), and unpatched gets me 656 million/sec (~1.5ns) (though that includes a function call, and when Hash does it it's inlined), but it's only slower when nbatch > 1 due to the condition. To put that number into persepective, I can hash 10 million single-int tuples from a prewarmed seq scan in 2.5s without batching or parallelism, so that's 250ns per tuple. This'd be +0.4% of that, and I do see it in a few more samples with my profiler, but it's still basically nothing, and lost in the noise with other noisy partitioning overheads like IO. Thoughts? [1] Working: https://gist.github.com/macdice/b7eb7015846b8384972c2c7cfd6d2c22
Attachment
Hello I can make no useful comment on the correctness of the new bit distribution. But I can make two comments on this part: On 2019-Nov-21, Thomas Munro wrote: > So here's a draft steal-the-bucket-bits patch. Yeah, > reverse_bits_u32() may be in the wrong header. I think pg_bitutils.h is the right place in master, but that file didn't exist earlier. I think access/hash.h is the right place in older branches, which is where hash_any() used to be. > But is it too slow? > On my desktop I can call ExecGetBucketAndBatch() 353 million times per > second (~2.8ns), and unpatched gets me 656 million/sec (~1.5ns) > (though that includes a function call, and when Hash does it it's > inlined), but it's only slower when nbatch > 1 due to the condition. If the whole query takes hours (or even minutes) to run, then adding one second of runtime is not going to change things in any noticeable way. I'd rather have the extreme cases working and take additional 1.3ns per output row, than not work at all. > To put that number into persepective, I can hash 10 million single-int > tuples from a prewarmed seq scan in 2.5s without batching or > parallelism, so that's 250ns per tuple. This'd be +0.4% of that, and > I do see it in a few more samples with my profiler, but it's still > basically nothing, and lost in the noise with other noisy partitioning > overheads like IO. Thoughts? That summarizes it well for me: yes, it's a slowdown, yes it's barely measurable. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Dec 03, 2019 at 02:07:04PM -0300, Alvaro Herrera wrote: >Hello > >I can make no useful comment on the correctness of the new bit >distribution. But I can make two comments on this part: > >On 2019-Nov-21, Thomas Munro wrote: > >> So here's a draft steal-the-bucket-bits patch. Yeah, >> reverse_bits_u32() may be in the wrong header. > >I think pg_bitutils.h is the right place in master, but that file didn't >exist earlier. I think access/hash.h is the right place in older >branches, which is where hash_any() used to be. > >> But is it too slow? >> On my desktop I can call ExecGetBucketAndBatch() 353 million times per >> second (~2.8ns), and unpatched gets me 656 million/sec (~1.5ns) >> (though that includes a function call, and when Hash does it it's >> inlined), but it's only slower when nbatch > 1 due to the condition. > >If the whole query takes hours (or even minutes) to run, then adding one >second of runtime is not going to change things in any noticeable way. >I'd rather have the extreme cases working and take additional 1.3ns per >output row, than not work at all. > >> To put that number into persepective, I can hash 10 million single-int >> tuples from a prewarmed seq scan in 2.5s without batching or >> parallelism, so that's 250ns per tuple. This'd be +0.4% of that, and >> I do see it in a few more samples with my profiler, but it's still >> basically nothing, and lost in the noise with other noisy partitioning >> overheads like IO. Thoughts? > >That summarizes it well for me: yes, it's a slowdown, yes it's barely >measurable. > OK, I did a bit of testing with this patch today, and I can confirm it does indeed help quite a bit. I only used a smaller data set with 4.5B rows, but it was enough to trigger the issue - it allocated ~1TB of temp space, and then crashed. The annoying thing is that it's the workers that crash, and the leader failed to notice that, so it was waiting in WaitForParallelWorkersToExit forever. Not sure what the issue is. Anyway, with the patch applied, the query completed in ~2.2 hours, after allocating ~235GB of temp files. So that's good, the patch clearly improves the situation quite a bit. As for the performance impact, I did this: create table dim (id int, val text); insert into dim select i, md5(i::text) from generate_series(1,1000000) s(i); create table fact (id int, val text); insert into fact select mod(i,1000000)+1, md5(i::text) from generate_series(1,25000000) s(i); set max_parallel_workers_per_gather = 0; select count(*) from fact join dim using (id); So a perfectly regular join between 1M and 25M table. On my machine, this takes ~8851ms on master and 8979ms with the patch (average of about 20 runs with minimal variability). That's ~1.4% regression, so a bit more than the 0.4% mentioned before. Not a huge difference though, and some of it might be due to different binary layout etc. It's probably still a good idea to do this, although I wonder if we might start doing this only when actually running out of bits (and use the current algorithm by default). But I have no idea if that would be any cheaper, and it would add complexity. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Dec-10, Tomas Vondra wrote: > It's probably still a good idea to do this, although I wonder if we > might start doing this only when actually running out of bits (and use > the current algorithm by default). But I have no idea if that would be > any cheaper, and it would add complexity. I'm under the impression that this patch is proposed for backpatching, and that in master we can simply increase the hash width. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Dec 10, 2019 at 02:35:16PM -0300, Alvaro Herrera wrote: >On 2019-Dec-10, Tomas Vondra wrote: > >> It's probably still a good idea to do this, although I wonder if we >> might start doing this only when actually running out of bits (and use >> the current algorithm by default). But I have no idea if that would be >> any cheaper, and it would add complexity. > >I'm under the impression that this patch is proposed for backpatching, >and that in master we can simply increase the hash width. > Possibly, although AFAIK we don't have that patch yet, and I don't recall any measurements (so it might have overhead too, not sure). But it's true additional complexity might complicate backpatching. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > As for the performance impact, I did this: > create table dim (id int, val text); > insert into dim select i, md5(i::text) from generate_series(1,1000000) s(i); > create table fact (id int, val text); > insert into fact select mod(i,1000000)+1, md5(i::text) from generate_series(1,25000000) s(i); > set max_parallel_workers_per_gather = 0; > select count(*) from fact join dim using (id); > So a perfectly regular join between 1M and 25M table. On my machine, > this takes ~8851ms on master and 8979ms with the patch (average of about > 20 runs with minimal variability). That's ~1.4% regression, so a bit > more than the 0.4% mentioned before. Not a huge difference though, and > some of it might be due to different binary layout etc. Hmm ... I replicated this experiment here, using my usual precautions to get more-or-less-reproducible numbers [1]. I concur that the patch seems to be slower, but only by around half a percent on the median numbers, which is much less than the run-to-run variation. So that would be fine --- except that in my first set of runs, I forgot the "set max_parallel_workers_per_gather" step and hence tested this same data set with a parallel hash join. And in that scenario, I got a repeatable slowdown of around 7.5%, which is far above the noise floor. So that's not good --- why does this change make PHJ worse? regards, tom lane [1] https://www.postgresql.org/message-id/31686.1574722301%40sss.pgh.pa.us
On Tue, Dec 10, 2019 at 02:49:42PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> As for the performance impact, I did this: > >> create table dim (id int, val text); >> insert into dim select i, md5(i::text) from generate_series(1,1000000) s(i); >> create table fact (id int, val text); >> insert into fact select mod(i,1000000)+1, md5(i::text) from generate_series(1,25000000) s(i); >> set max_parallel_workers_per_gather = 0; >> select count(*) from fact join dim using (id); > >> So a perfectly regular join between 1M and 25M table. On my machine, >> this takes ~8851ms on master and 8979ms with the patch (average of about >> 20 runs with minimal variability). That's ~1.4% regression, so a bit >> more than the 0.4% mentioned before. Not a huge difference though, and >> some of it might be due to different binary layout etc. > >Hmm ... I replicated this experiment here, using my usual precautions >to get more-or-less-reproducible numbers [1]. I concur that the >patch seems to be slower, but only by around half a percent on the >median numbers, which is much less than the run-to-run variation. > Sounds good. >So that would be fine --- except that in my first set of runs, >I forgot the "set max_parallel_workers_per_gather" step and hence >tested this same data set with a parallel hash join. And in that >scenario, I got a repeatable slowdown of around 7.5%, which is far >above the noise floor. So that's not good --- why does this change >make PHJ worse? > Hmmm, I can't reproduce this. For me the timings from 20 runs look like this: master | patched workers=2 workers=5 | workers=2 workers=5 -------------------------|-------------------------- 3153 1785 | 3185 1790 3167 1783 | 3220 1783 I haven't done the extra steps with cpupower/taskset, but the numbers seem pretty consistent. I'll try on another machine tomorrow. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Dec 11, 2019 at 7:36 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Tue, Dec 10, 2019 at 02:35:16PM -0300, Alvaro Herrera wrote: > >On 2019-Dec-10, Tomas Vondra wrote: > >> It's probably still a good idea to do this, although I wonder if we > >> might start doing this only when actually running out of bits (and use > >> the current algorithm by default). But I have no idea if that would be > >> any cheaper, and it would add complexity. Yeah, I understand your desire to minimise the change here. One thing we could do to get exactly the same algorithm as today right up until we run out of bits and then begin stealing bucket bits would be to rotate instead of shifting: - *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1); + *batchno = rotate(hashvalue, hashtable->log2_nbuckets) & (nbatch - 1); Attached. Thoughts? At a glance, the definition of rotate() that I used seems to be recognised by my compiler, so that the generated code differs only by a SHR instruction changing to a ROR instruction on my system. The bit-reversing thing has once nice advantage: you are allowed to increase nbuckets later. But then I thought of another way to achieve that: take batchno from the low end, and bucketno from the high end. I'm not persuing that, based on the suspicion that what we all want here is the minimal possible change. > >I'm under the impression that this patch is proposed for backpatching, > >and that in master we can simply increase the hash width. > > Possibly, although AFAIK we don't have that patch yet, and I don't > recall any measurements (so it might have overhead too, not sure). But > it's true additional complexity might complicate backpatching. What I'd like to experiment with in the new year is a code specialisation scheme that further developers what I did with ExecHashJoinImpl(). This would increase the executable size by including a bunch of different variants of the hash join code, but avoid adding new branches (and also remove some existing branches) in the hot paths. We'd still write one "template" version of the algorithm that contains if statements for the points of variation, and then we'd call it from a load of wrapper functions with constant arguments and forced inlining. For example, there might be a variant for { non-parallel, 64 bit hashes, no skew optimisation }. A more extreme version of the idea inlines the actual hashing functions for a handful of common cases like test and integer keys, but that gets into combination explosion territory among other problems. I have some early prototype code along these lines but it's not good enough yet; more soon. I should probably think about Andres's proposal to merge Hash and Hash Join nodes first.
Attachment
On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > The annoying thing is that it's the workers that crash, and the leader > failed to notice that, so it was waiting in WaitForParallelWorkersToExit > forever. Not sure what the issue is. That's bad. Some edge case in WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is b0rked perhaps? Can you reproduce it, and tell us what the stack looks like, maybe on a new thread?
On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote: >On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> The annoying thing is that it's the workers that crash, and the leader >> failed to notice that, so it was waiting in WaitForParallelWorkersToExit >> forever. Not sure what the issue is. > >That's bad. Some edge case in >WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is >b0rked perhaps? Can you reproduce it, and tell us what the stack >looks like, maybe on a new thread? Sure. I'll try reproducing it and get this extra info. I was planning to do that, but then I lost the data for the report because of a silly mistake. Early next week, hopefully. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote: >On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote: >>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra >><tomas.vondra@2ndquadrant.com> wrote: >>>The annoying thing is that it's the workers that crash, and the leader >>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit >>>forever. Not sure what the issue is. >> >>That's bad. Some edge case in >>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is >>b0rked perhaps? Can you reproduce it, and tell us what the stack >>looks like, maybe on a new thread? > >Sure. I'll try reproducing it and get this extra info. I was planning to >do that, but then I lost the data for the report because of a silly >mistake. Early next week, hopefully. > I've tried reproducing this, but so far no luck. I've only managed to do two runs, though, because this particular machine only has space for the 1TB of temp files on spinning rust. I wonder if it's possible to reproduce the issue with a smaller data set? I've tried using smaller work_mem, but then it does not trigger the issue :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote: >On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote: >>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote: >>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra >>><tomas.vondra@2ndquadrant.com> wrote: >>>>The annoying thing is that it's the workers that crash, and the leader >>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit >>>>forever. Not sure what the issue is. >>> >>>That's bad. Some edge case in >>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is >>>b0rked perhaps? Can you reproduce it, and tell us what the stack >>>looks like, maybe on a new thread? >> >>Sure. I'll try reproducing it and get this extra info. I was planning to >>do that, but then I lost the data for the report because of a silly >>mistake. Early next week, hopefully. >> > >I've tried reproducing this, but so far no luck. I've only managed to do >two runs, though, because this particular machine only has space for the >1TB of temp files on spinning rust. I wonder if it's possible to >reproduce the issue with a smaller data set? I've tried using smaller >work_mem, but then it does not trigger the issue :-( > And of course, the very next run after I send the "can't reproduce" message actually reproduces the issue. How typical ... So, there were ~8 workers while the query was running, with the leader process being 29417. At the moment, the list of processes looks like this: bench2 ~ # ps ax | grep postgres 11877 pts/2 S 0:00 su - postgres 14082 ? Ss 12:38 postgres: parallel worker for PID 29417 14083 ? Ss 12:28 postgres: parallel worker for PID 29417 17334 pts/0 S+ 0:00 grep --colour=auto postgres 29408 ? Ss 0:01 /var/lib/postgresql/pg-master/bin/postgres -D /mnt/data/data-master 29410 ? Ss 0:00 postgres: checkpointer 29411 ? Ss 0:01 postgres: background writer 29412 ? Ss 0:01 postgres: walwriter 29413 ? Ss 0:02 postgres: autovacuum launcher 29414 ? Ss 0:14 postgres: stats collector 29415 ? Ss 0:00 postgres: logical replication launcher 29417 ? Ss 75:11 postgres: postgres test [local] EXPLAIN and it failes like this: 2019-12-16 20:25:00.617 CET [14081] ERROR: invalid DSA memory alloc request size 1711276032 2019-12-16 20:25:00.617 CET [14081] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.617 CET [29417] ERROR: invalid DSA memory alloc request size 1711276032 2019-12-16 20:25:00.617 CET [29417] CONTEXT: parallel worker 2019-12-16 20:25:00.617 CET [29417] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.618 CET [14080] FATAL: terminating connection due to administrator command 2019-12-16 20:25:00.618 CET [14080] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.618 CET [14085] FATAL: terminating connection due to administrator command 2019-12-16 20:25:00.618 CET [14085] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.618 CET [14084] FATAL: terminating connection due to administrator command 2019-12-16 20:25:00.618 CET [14084] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.618 CET [14079] FATAL: terminating connection due to administrator command 2019-12-16 20:25:00.618 CET [14079] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.618 CET [14078] FATAL: terminating connection due to administrator command 2019-12-16 20:25:00.618 CET [14078] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); 2019-12-16 20:25:00.746 CET [29408] LOG: background worker "parallel worker" (PID 14081) exited with exit code 1 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14079) exited with exit code 1 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14085) exited with exit code 1 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14084) exited with exit code 1 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14080) exited with exit code 1 2019-12-16 20:25:00.818 CET [29408] LOG: background worker "parallel worker" (PID 14078) exited with exit code 1 which is ~2h ago. There's no activity at all (no I/O or anything like that), the temp directory is ~1TB of allocated files: bench2 /var/lib/postgresql # du -s /mnt/raid/pgsql_tmp/ 1099760800 /mnt/raid/pgsql_tmp/ which means nothing got cleaned up. The backtraces for the remaining backends look like this: 1) 29417 (i.e. the leader) 0x00007f64684d2bfb in epoll_wait (epfd=5, events=0x55d887416dc0, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout); (gdb) bt #0 0x00007f64684d2bfb in epoll_wait (epfd=5, events=0x55d887416dc0, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d887416d48, cur_timeout=-1, occurred_events=0x7ffc6e805c70, nevents=1)at latch.c:1080 #2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d887416d48, timeout=-1, occurred_events=0x7ffc6e805c70, nevents=1, wait_event_info=134217728)at latch.c:1032 #3 0x000055d8864d4565 in WaitLatchOrSocket (latch=0x7f64670ac194, wakeEvents=17, sock=-1, timeout=-1, wait_event_info=134217728)at latch.c:407 #4 0x000055d8864d4421 in WaitLatch (latch=0x7f64670ac194, wakeEvents=17, timeout=0, wait_event_info=134217728) at latch.c:347 #5 0x000055d88644c18b in WaitForBackgroundWorkerShutdown (handle=0x55d88742abc8) at bgworker.c:1153 #6 0x000055d886147690 in WaitForParallelWorkersToExit (pcxt=0x55d88742a338) at parallel.c:842 #7 0x000055d8861478af in DestroyParallelContext (pcxt=0x55d88742a338) at parallel.c:924 #8 0x000055d886147feb in AtEOXact_Parallel (isCommit=false) at parallel.c:1197 #9 0x000055d886153774 in AbortTransaction () at xact.c:2654 #10 0x000055d886153f80 in AbortCurrentTransaction () at xact.c:3196 #11 0x000055d8865044e7 in PostgresMain (argc=1, argv=0x55d8873d41e8, dbname=0x55d8873d4118 "test", username=0x55d8873d40f8"postgres") at postgres.c:4081 #12 0x000055d88645f0d6 in BackendRun (port=0x55d8873ca150) at postmaster.c:4498 #13 0x000055d88645e7fc in BackendStartup (port=0x55d8873ca150) at postmaster.c:4189 #14 0x000055d88645abd0 in ServerLoop () at postmaster.c:1727 #15 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400 #16 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210 2) 14082 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout); (gdb) bt #0 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130, nevents=1)at latch.c:1080 #2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130, nevents=1, wait_event_info=134217742)at latch.c:1032 #3 0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at condition_variable.c:186 #4 0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at condition_variable.c:126 #5 0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191 #6 0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189 #7 0x000055d88632ca1f in ExecParallelHashTuplePrealloc (hashtable=0x55d887481e70, batchno=237, size=40) at nodeHash.c:3308 #8 0x000055d886329dfe in ExecParallelHashTableInsert (hashtable=0x55d887481e70, slot=0x55d887470e10, hashvalue=994179508)at nodeHash.c:1725 #9 0x000055d8863275dc in MultiExecParallelHash (node=0x55d887470860) at nodeHash.c:288 #10 0x000055d8863271e3 in MultiExecHash (node=0x55d887470860) at nodeHash.c:111 #11 0x000055d88630b79c in MultiExecProcNode (node=0x55d887470860) at execProcnode.c:500 #12 0x000055d88632cfac in ExecHashJoinImpl (pstate=0x55d8874700a0, parallel=true) at nodeHashjoin.c:291 #13 0x000055d88632d617 in ExecParallelHashJoin (pstate=0x55d8874700a0) at nodeHashjoin.c:588 #14 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d8874700a0) at execProcnode.c:460 #15 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d8874700a0) at execProcnode.c:444 #16 0x000055d886318a14 in ExecProcNode (node=0x55d8874700a0) at ../../../src/include/executor/executor.h:240 #17 0x000055d886318e8b in fetch_input_tuple (aggstate=0x55d88746f628) at nodeAgg.c:405 #18 0x000055d88631b2fa in agg_retrieve_direct (aggstate=0x55d88746f628) at nodeAgg.c:1757 #19 0x000055d88631af4a in ExecAgg (pstate=0x55d88746f628) at nodeAgg.c:1572 #20 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d88746f628) at execProcnode.c:460 #21 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d88746f628) at execProcnode.c:444 #22 0x000055d886300882 in ExecProcNode (node=0x55d88746f628) at ../../../src/include/executor/executor.h:240 #23 0x000055d886303055 in ExecutePlan (estate=0x55d88746f398, planstate=0x55d88746f628, use_parallel_mode=false, operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55d88741d408, execute_once=true)at execMain.c:1646 #24 0x000055d886300d9a in standard_ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:364 #25 0x000055d886300c3c in ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:308 #26 0x000055d886307c2b in ParallelQueryMain (seg=0x55d8873d23d8, toc=0x7f63602a0000) at execParallel.c:1401 #27 0x000055d886148632 in ParallelWorkerMain (main_arg=2143010421) at parallel.c:1430 #28 0x000055d88644ba56 in StartBackgroundWorker () at bgworker.c:834 #29 0x000055d88645ff82 in do_start_bgworker (rw=0x55d8873ca340) at postmaster.c:5839 #30 0x000055d88646038f in maybe_start_bgworkers () at postmaster.c:6065 #31 0x000055d88645f265 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5234 #32 <signal handler called> #33 0x00007f64684c956b in __GI___select (nfds=7, readfds=0x7ffc6e8060e0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e806050)at ../sysdeps/unix/sysv/linux/select.c:41 #34 0x000055d88645aa3c in ServerLoop () at postmaster.c:1691 #35 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400 #36 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210 3) 14083 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout); (gdb) bt #0 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805120, nevents=1)at latch.c:1080 #2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805120, nevents=1, wait_event_info=134217742)at latch.c:1032 #3 0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at condition_variable.c:186 #4 0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at condition_variable.c:126 #5 0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191 #6 0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189 #7 0x000055d88632b93c in ExecParallelHashTupleAlloc (hashtable=0x55d887481e70, size=40, shared=0x7ffc6e8052b8) at nodeHash.c:2805 #8 0x000055d886329d46 in ExecParallelHashTableInsert (hashtable=0x55d887481e70, slot=0x55d887470e10, hashvalue=2006491)at nodeHash.c:1702 #9 0x000055d8863275dc in MultiExecParallelHash (node=0x55d887470860) at nodeHash.c:288 #10 0x000055d8863271e3 in MultiExecHash (node=0x55d887470860) at nodeHash.c:111 #11 0x000055d88630b79c in MultiExecProcNode (node=0x55d887470860) at execProcnode.c:500 #12 0x000055d88632cfac in ExecHashJoinImpl (pstate=0x55d8874700a0, parallel=true) at nodeHashjoin.c:291 #13 0x000055d88632d617 in ExecParallelHashJoin (pstate=0x55d8874700a0) at nodeHashjoin.c:588 #14 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d8874700a0) at execProcnode.c:460 #15 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d8874700a0) at execProcnode.c:444 #16 0x000055d886318a14 in ExecProcNode (node=0x55d8874700a0) at ../../../src/include/executor/executor.h:240 #17 0x000055d886318e8b in fetch_input_tuple (aggstate=0x55d88746f628) at nodeAgg.c:405 #18 0x000055d88631b2fa in agg_retrieve_direct (aggstate=0x55d88746f628) at nodeAgg.c:1757 #19 0x000055d88631af4a in ExecAgg (pstate=0x55d88746f628) at nodeAgg.c:1572 #20 0x000055d88630b6e9 in ExecProcNodeInstr (node=0x55d88746f628) at execProcnode.c:460 #21 0x000055d88630b6ba in ExecProcNodeFirst (node=0x55d88746f628) at execProcnode.c:444 #22 0x000055d886300882 in ExecProcNode (node=0x55d88746f628) at ../../../src/include/executor/executor.h:240 #23 0x000055d886303055 in ExecutePlan (estate=0x55d88746f398, planstate=0x55d88746f628, use_parallel_mode=false, operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55d88741d408, execute_once=true)at execMain.c:1646 #24 0x000055d886300d9a in standard_ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:364 #25 0x000055d886300c3c in ExecutorRun (queryDesc=0x55d88741e9e0, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:308 #26 0x000055d886307c2b in ParallelQueryMain (seg=0x55d8873d23d8, toc=0x7f63602a0000) at execParallel.c:1401 #27 0x000055d886148632 in ParallelWorkerMain (main_arg=2143010421) at parallel.c:1430 #28 0x000055d88644ba56 in StartBackgroundWorker () at bgworker.c:834 #29 0x000055d88645ff82 in do_start_bgworker (rw=0x55d8873ca590) at postmaster.c:5839 #30 0x000055d88646038f in maybe_start_bgworkers () at postmaster.c:6065 #31 0x000055d88645f265 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5234 #32 <signal handler called> #33 0x00007f64684c956b in __GI___select (nfds=7, readfds=0x7ffc6e8060e0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e806050)at ../sysdeps/unix/sysv/linux/select.c:41 #34 0x000055d88645aa3c in ServerLoop () at postmaster.c:1691 #35 0x000055d88645a366 in PostmasterMain (argc=3, argv=0x55d8873a3f70) at postmaster.c:1400 #36 0x000055d88636fab5 in main (argc=3, argv=0x55d8873a3f70) at main.c:210 So the only difference between the two remaining worker processes is frame #7, where 14082 calls ExecParallelHashTuplePrealloc while 14083 calls ExecParallelHashTupleAlloc. I'll leave the system in this "stuck" space in case we need to do further investigation. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Dec 17, 2019 at 10:02 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote: > >On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote: > >>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote: > >>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra > >>><tomas.vondra@2ndquadrant.com> wrote: > >>>>The annoying thing is that it's the workers that crash, and the leader > >>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit > >>>>forever. Not sure what the issue is. > >>> > >>>That's bad. Some edge case in > >>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is > >>>b0rked perhaps? Can you reproduce it, and tell us what the stack > >>>looks like, maybe on a new thread? > >> > >>Sure. I'll try reproducing it and get this extra info. I was planning to > >>do that, but then I lost the data for the report because of a silly > >>mistake. Early next week, hopefully. > >> > > > >I've tried reproducing this, but so far no luck. I've only managed to do > >two runs, though, because this particular machine only has space for the > >1TB of temp files on spinning rust. I wonder if it's possible to > >reproduce the issue with a smaller data set? I've tried using smaller > >work_mem, but then it does not trigger the issue :-( > > > > And of course, the very next run after I send the "can't reproduce" > message actually reproduces the issue. How typical ... Thanks for reproducing this. > 2019-12-16 20:25:00.617 CET [14081] ERROR: invalid DSA memory alloc request size 1711276032 > 2019-12-16 20:25:00.617 CET [14081] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); > 2019-12-16 20:25:00.617 CET [29417] ERROR: invalid DSA memory alloc request size 1711276032 > 2019-12-16 20:25:00.617 CET [29417] CONTEXT: parallel worker Ok, so 14081 raised an error and exited, ... > 2019-12-16 20:25:00.746 CET [29408] LOG: background worker "parallel worker" (PID 14081) exited with exit code 1 ... and then the leader cancelled the query on 5 other workers, > 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14079) exited with exit code 1 > 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14085) exited with exit code 1 > 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14084) exited with exit code 1 > 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14080) exited with exit code 1 > 2019-12-16 20:25:00.818 CET [29408] LOG: background worker "parallel worker" (PID 14078) exited with exit code 1 ... but somehow there were 2 more workers that didn't get the message and are waiting: > #0 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > #1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130, nevents=1)at latch.c:1080 > #2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130, nevents=1,wait_event_info=134217742) at latch.c:1032 > #3 0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at condition_variable.c:186 > #4 0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at condition_variable.c:126 > #5 0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191 > #6 0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189 They're waiting on a condition variable, in a barrier, while trying to coordinate an increase in the number of batches, while hashing. I suspect the details of what exactly it's waiting for are not relevant here though. ConditionVariableTimedSleep() contains a standard latch wait loop with CHECK_FOR_INTERRUPTS(), and parallel workers run with the standard SIGTERM handler die(). So what I expect to happen here is: 1. DestroyParallelContext() calls TerminateBackgroundWorker() for all workers. 2. TerminateBackgroundWorker() sets slot->terminate, and signals the postmaster. 3. The postmaster sees the terminate flag and does kill(worker, SIGTERM). 4. The worker handles SIGTERM by setting ProcDiePending and setting its own latch. 5. The ConditionVariableTimedSleep()'s syscall returs, CHECK_FOR_INTERRUPTS() runs and sees ProcDiePending. So how did it escape this fate? Ooooh. Now I'm wondering if the problem is that ConditionVariablePrepareToSleep() contains a call to ResetLatch(), but doesn't then run CHECK_FOR_INTERRUPTS(). Perhaps that ate the latch that was set by die(). There isn't another CHECK_FOR_INTERRUPTS() until *after* the WaitEventSetWait() in ConditionVariableTimedSleep(). > I'll leave the system in this "stuck" space in case we need to do > further investigation. What are the value sof InterruptHoldoffCount, CritSectionCount, InterruptPending and ProcDiePending in the stuck workers?
On Wed, Dec 18, 2019 at 11:21:11AM +1300, Thomas Munro wrote: >On Tue, Dec 17, 2019 at 10:02 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote: >> >On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote: >> >>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote: >> >>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra >> >>><tomas.vondra@2ndquadrant.com> wrote: >> >>>>The annoying thing is that it's the workers that crash, and the leader >> >>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit >> >>>>forever. Not sure what the issue is. >> >>> >> >>>That's bad. Some edge case in >> >>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is >> >>>b0rked perhaps? Can you reproduce it, and tell us what the stack >> >>>looks like, maybe on a new thread? >> >> >> >>Sure. I'll try reproducing it and get this extra info. I was planning to >> >>do that, but then I lost the data for the report because of a silly >> >>mistake. Early next week, hopefully. >> >> >> > >> >I've tried reproducing this, but so far no luck. I've only managed to do >> >two runs, though, because this particular machine only has space for the >> >1TB of temp files on spinning rust. I wonder if it's possible to >> >reproduce the issue with a smaller data set? I've tried using smaller >> >work_mem, but then it does not trigger the issue :-( >> > >> >> And of course, the very next run after I send the "can't reproduce" >> message actually reproduces the issue. How typical ... > >Thanks for reproducing this. > >> 2019-12-16 20:25:00.617 CET [14081] ERROR: invalid DSA memory alloc request size 1711276032 >> 2019-12-16 20:25:00.617 CET [14081] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i); >> 2019-12-16 20:25:00.617 CET [29417] ERROR: invalid DSA memory alloc request size 1711276032 >> 2019-12-16 20:25:00.617 CET [29417] CONTEXT: parallel worker > >Ok, so 14081 raised an error and exited, ... > >> 2019-12-16 20:25:00.746 CET [29408] LOG: background worker "parallel worker" (PID 14081) exited with exit code 1 > >... and then the leader cancelled the query on 5 other workers, > >> 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14079) exited with exit code 1 >> 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14085) exited with exit code 1 >> 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14084) exited with exit code 1 >> 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14080) exited with exit code 1 >> 2019-12-16 20:25:00.818 CET [29408] LOG: background worker "parallel worker" (PID 14078) exited with exit code 1 > >... but somehow there were 2 more workers that didn't get the message >and are waiting: > >> #0 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 >> #1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130, nevents=1)at latch.c:1080 >> #2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130, nevents=1,wait_event_info=134217742) at latch.c:1032 >> #3 0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at condition_variable.c:186 >> #4 0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at condition_variable.c:126 >> #5 0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191 >> #6 0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189 > >They're waiting on a condition variable, in a barrier, while trying to >coordinate an increase in the number of batches, while hashing. I >suspect the details of what exactly it's waiting for are not relevant >here though. ConditionVariableTimedSleep() contains a standard latch >wait loop with CHECK_FOR_INTERRUPTS(), and parallel workers run with >the standard SIGTERM handler die(). So what I expect to happen here >is: > >1. DestroyParallelContext() calls TerminateBackgroundWorker() for all workers. >2. TerminateBackgroundWorker() sets slot->terminate, and signals the >postmaster. >3. The postmaster sees the terminate flag and does kill(worker, SIGTERM). >4. The worker handles SIGTERM by setting ProcDiePending and setting >its own latch. >5. The ConditionVariableTimedSleep()'s syscall returs, >CHECK_FOR_INTERRUPTS() runs and sees ProcDiePending. > >So how did it escape this fate? Ooooh. Now I'm wondering if the >problem is that ConditionVariablePrepareToSleep() contains a call to >ResetLatch(), but doesn't then run CHECK_FOR_INTERRUPTS(). Perhaps >that ate the latch that was set by die(). There isn't another >CHECK_FOR_INTERRUPTS() until *after* the WaitEventSetWait() in >ConditionVariableTimedSleep(). > >> I'll leave the system in this "stuck" space in case we need to do >> further investigation. > >What are the value sof InterruptHoldoffCount, CritSectionCount, >InterruptPending and ProcDiePending in the stuck workers? In both processes it's like this: (gdb) p InterruptHoldoffCount $1 = 0 (gdb) p CritSectionCount $2 = 0 (gdb) p InterruptPending $3 = 1 (gdb) p ProcDiePending $5 = 1 So yeah, that probably means the theory about CHECK_FOR_INTERRUPTS call missing right after latch reset is correct. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Dec 18, 2019 at 12:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > So yeah, that probably means the theory about CHECK_FOR_INTERRUPTS call > missing right after latch reset is correct. I wonder if it'd be better to remove that ResetLatch() (what is it good for?) or add CFI().
On Sat, Dec 14, 2019 at 1:22 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > >On 2019-Dec-10, Tomas Vondra wrote: > > >> It's probably still a good idea to do this, although I wonder if we > > >> might start doing this only when actually running out of bits (and use > > >> the current algorithm by default). But I have no idea if that would be > > >> any cheaper, and it would add complexity. > - *batchno = (hashvalue >> hashtable->log2_nbuckets) & > (nbatch - 1); > + *batchno = rotate(hashvalue, hashtable->log2_nbuckets) > & (nbatch - 1); I ran the 150MB 4096 batch "sevenb" self-join with the "rotate" patch, and it worked as expected. I'm now planning to commit that version, unless there are objections or someone wants to argue for a different way to spell rotate() etc. If I disassemble and diff unpatched and patched ExecHashGetBucketAndBatch() and functions that inline it, as compiled by Clang 6, I see just eg: - 0x00000000006ba54e <+30>: shr %cl,%esi + 0x00000000006ba54e <+30>: ror %cl,%esi Here's a festive pictorial representation of the various dead horses flogged in this thread, for the holidays (view in monospace font): The "shift" algorithm in (pre-existing coding): <----------batchno---| |---bucketno---| [MSB . . . . . . . . . . . . LSB] * can't increase nbuckets (or max bucket load would degrade too soon) * reads zeroes off the end, so further partitioning is broken The "bit reverse" technique proposed earlier: |---batchno-----------> <---bucketno---| [MSB . . . . . . . . . . . . LSB] * could increase nbuckets: bucket load degradation is maximally deferred * takes a few more instructions to compute batchno The "rotate" technique: ------batchno---| <--- |---bucketno---| [MSB . . . . . . . . . . . . LSB] * same as "shift" until you run out of bits, then bucket load degrades * can't increase nbuckets * only changes one machine code instruction The "swap ends" technique: <----------batchno---| |---bucketno---| [MSB . . . . . . . . . . . . LSB] * really just a rotated version of "rotate" technique * meh
Thomas Munro <thomas.munro@gmail.com> writes: > I ran the 150MB 4096 batch "sevenb" self-join with the "rotate" patch, > and it worked as expected. I'm now planning to commit that version, > unless there are objections or someone wants to argue for a different > way to spell rotate() etc. FWIW, I do want to argue for a different way to spell that. It seems like a mighty generic function name --- in particular, there's no hint of the word width we're rotating in. Maybe "rotate32" would be good enough, though I'm tempted to propose "pg_rotate32" or "rotate_bit32" or some other variation on that. regards, tom lane
On Sat, Dec 21, 2019 at 6:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > I ran the 150MB 4096 batch "sevenb" self-join with the "rotate" patch, > > and it worked as expected. I'm now planning to commit that version, > > unless there are objections or someone wants to argue for a different > > way to spell rotate() etc. > > FWIW, I do want to argue for a different way to spell that. It seems > like a mighty generic function name --- in particular, there's no > hint of the word width we're rotating in. Maybe "rotate32" would be > good enough, though I'm tempted to propose "pg_rotate32" or "rotate_bit32" > or some other variation on that. Yeah, I suppose it would be nice to get something reusable out of this rather than a local definition only. The other important characteristic is the direction of rotation, so here's a version that defines pg_rotate_right32() in pg_bitutils.h. In back-branches without that header we could define it locally.
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > Yeah, I suppose it would be nice to get something reusable out of this > rather than a local definition only. The other important > characteristic is the direction of rotation, so here's a version that > defines pg_rotate_right32() in pg_bitutils.h. In back-branches > without that header we could define it locally. That looks like a good approach. One thought is that you could rely on BITS_PER_BYTE from c.h, rather than CHAR_BIT; then you don't have to #include <limits.h> in bitutils.h, which seems like header creep we could do without. (I think most other places that want this constant use BITS_PER_BYTE, too.) regards, tom lane
On Sun, Dec 22, 2019 at 4:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Yeah, I suppose it would be nice to get something reusable out of this > > rather than a local definition only. The other important > > characteristic is the direction of rotation, so here's a version that > > defines pg_rotate_right32() in pg_bitutils.h. In back-branches > > without that header we could define it locally. > > That looks like a good approach. One thought is that you could rely > on BITS_PER_BYTE from c.h, rather than CHAR_BIT; then you don't have > to #include <limits.h> in bitutils.h, which seems like header creep > we could do without. (I think most other places that want this > constant use BITS_PER_BYTE, too.) Done, and pushed. I'm hoping to look into wider hashing support which will give better results on big joins soon, but not in time for 13.