Re: Use BumpContext contexts for TupleHashTables' tablecxt - Mailing list pgsql-hackers
| From | David Rowley | 
|---|---|
| Subject | Re: Use BumpContext contexts for TupleHashTables' tablecxt | 
| Date | |
| Msg-id | CAApHDvpa=Ag++Fyhs8JURv4qUa8UeTkOGPJw1VPTTaz+qg7opg@mail.gmail.com Whole thread Raw | 
| In response to | Re: Use BumpContext contexts for TupleHashTables' tablecxt (Tom Lane <tgl@sss.pgh.pa.us>) | 
| Responses | Re: Use BumpContext contexts for TupleHashTables' tablecxt | 
| List | pgsql-hackers | 
On Thu, 30 Oct 2025 at 08:51, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > >> I can't help wonder if we can improve the memory context parameter > >> names in BuildTupleHashTable(). Every time I see "tablecxt" I have to > >> remind myself that it's not for the bucket array, just the stuff we > >> have the buckets point to. Would "hashedtuplecxt" be better? > > > I agree these names are not great. I think they might be leftovers > > from a time when there actually was a complete hash-table structure > > in that context. > > Looking closer, it seems like a lot of the confusion here arose when > TupleHashTables were rebased onto simplehash.h. That patch seems to > have paid about zero attention to updating comments that it'd > falsified or at least made misleading. Here's a v2 that tries to > clean up some of the mess. Thanks for doing that cleanup. A good improvement. Just a couple of notes from the review for your consideration: 1) The comment in ExecEndSetOp() says: "/* free subsidiary stuff including hashtable */". How about adjusting that to "hashtable data"? Is it worth mentioning there that we leave it up to the destruction of the hash table itself to when the ExecutorState context is reset? (Apparently execGrouping.c does not expose any way to do tuplehash_destroy() anyway) 2) I'm not sure if it makes it neater or not, but did you consider moving the creation of the setopstate->tuplesContext context into build_hash_table()? The motivation there is that it keeps the hash-related initialisations together. 3) Not for this patch, but wanted to note the observation: I see we always create the hash table during plan startup in nodeSetOp.c. I suspect this could cause the same issue for Setops as I fixed in Memoize in 57f59396b, which fixed a slow EXPLAIN in [1]. That was mostly down to a bug that overestimated the number of buckets, but there's legitimate reasons to want lots of buckets too... a big table and lots of work_mem. create table t (a int); insert into t values(1); alter table t alter column a set (n_distinct = -1); -- all values distinct analyze t; -- hack fool the planner into thinking it's a big table. update pg_class set reltuples = 1e9 where oid = 't'::regclass; set work_mem = '4GB'; \timing on explain (summary on) select a from t except select a from t; HashSetOp Except (cost=25000002.00..27500002.00 rows=1000000000 width=4) Disabled: true -> Seq Scan on t (cost=0.00..10000001.00 rows=1000000000 width=4) -> Seq Scan on t t_1 (cost=0.00..10000001.00 rows=1000000000 width=4) Planning Time: 0.091 ms <-- not a long time (5 rows) Time: 1658.866 ms (00:01.659) <-- a long time. > I'm also proposing to move the MemoryContextReset of that context > into ResetTupleHashTable instead of documenting that we expect the > caller to do that. The old way just seems like a bizarre artifact. > > > Related to this, while I was chasing Jeff's complaint I realized that > > the none-too-small simplehash table for this is getting made in the > > query's ExecutorState. That's pretty awful from the standpoint of > > being able to blame memory consumption on the hash node. > > I didn't do anything about this. I briefly considered having > BuildTupleHashTable construct a per-hashtable context, but backed > off after seeing that nodeAgg.c goes to some effort to use the same > metacxt and tuplescxt for several hashtables. I'm not sure if that > had measured performance benefits behind it, but it well might have. > I now think that if we care, the right thing is to make the remaining > callers construct extra memory contexts for their hash tables. That > could be left for another day. Reminds me of 590b045c3, but I think maybe not as bad as the bucket array for the hash table is more likely to be a large allocation, which aset.c will put on a dedicated AllocBlock rather than sharing an AllocBlock with other chunks. ResetTupleHashTable() also just results in a SH_RESET(), which doesn't do any pfreeing work (just memset()). i.e. shouldn't cause fragmentation due to many rescans. Before 590b045c3, we used to pfree() all tuples at tuplestore_end(), which happens in a Materialize node's rescan. That resulted in bad fragmentation of the ExecutorState context. Nothing like that is happening here, so I'm not that worried about it, though I agree that it isn't ideal. David [1] https://postgr.es/m/CAFj8pRAMp%3DQsMi6sPQJ4W3hczoFJRvyXHJV3AZAZaMyTVM312Q%40mail.gmail.com
pgsql-hackers by date: