Should HashSetOp go away - Mailing list pgsql-hackers

From Jeff Janes
Subject Should HashSetOp go away
Date
Msg-id CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
Whole thread Raw
Responses Re: Should HashSetOp go away
List pgsql-hackers
Many years ago I ran into some problems doing maintenance tasks checking for short identifiers which existed in one table but not another.  It would choose HashSetOp, which was memory inefficient, and it was also unaware of how memory inefficient it was, leading it to blow well past work_mem, by many fold.  So if you set work_mem to a large value in order to get maintenance operations over with quickly while the system is basically single-user, it could cause crashes.  It certainly isn't the only part of PostgreSQL which is memory inefficient, but it did seem particularly egregious.

I noticed some changes in this code v18, so wanted to revisit the issue.  Under commit 27627929528e, it looks like it got 25% more memory efficient, but it thinks it got 40% more efficient, so the memory use got better but the estimation actually got worse.

Using the data:
create table jj as select lpad(x::text,15,'0') from generate_series(1,10000000) f(x);

And the dummy query:
select lpad from jj except select lpad from jj;

It goes from needing a work_mem of at least 270MB to choose HashSetOp where it actually uses 1.3GB (as determined by 'max resident size' from log_executor_stats, which is not perfect but should be pretty close--I intentionally used a small shared_buffers so that it didn't contribute much to the memory usage) in v17 to needing work_mem of at least 160MB while actually using 1.0GB in 18devel commit 276279.  Under 18.0, it is slightly but not meaningfully different from commit 276279.

I was thinking of ways to improve the memory usage (or at least its estimation) but decided maybe it would be better if HashSetOp went away entirely.  As far as I can tell HashSetOp has nothing to recommend it other than the fact that it already exists. If we instead used an elaboration on Hash Anti Join, then it would automatically get spilling to disk, parallel operations, better estimation, and the benefits of whatever micro optimizations people lavish on the highly used HashJoin machinery but not the obscure, little-used HashSetOp.

It would need to elaborate the HashAntiJoin so that it can deduplicate one input (in the case of EXCEPT) or count the other input (in the case of EXCEPT ALL).

Is there some reason this is not feasible?

Yes, I could (and did) rewrite my query to force it to use the AntiJoin, but why should people need to do that when the planner can do it instead?

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: PG18 GIN parallel index build crash - invalid memory alloc request size
Next
From: Mahmoud Ayman
Date:
Subject: Cannot log in to CommitFest due to cool-off period