Patch to support SEMI and ANTI join removal - Mailing list pgsql-hackers

From David Rowley
Subject Patch to support SEMI and ANTI join removal
Date
Msg-id CAApHDvpCBEfuc5tD=vniepAv0pU5m=q=fOQZcOdMHeei7OQPgQ@mail.gmail.com
Whole thread Raw
Responses Re: Patch to support SEMI and ANTI join removal
Re: Patch to support SEMI and ANTI join removal
List pgsql-hackers
I've been working away at allowing semi and anti joins to be added to the list of join types that our join removal code supports.

The basic idea is that we can removal a semi or anti join if the left hand relation references the relation that's being semi/anti joined if the join condition matches a foreign key definition on the left hand relation.

To give an example:

Given the 2 tables:
create table t2 (t1_id int primary key);
create table t1 (value int references t2);

The join to t2 would not be required in:

select * from t1 where value in(select t1_id from t2);

Neither would it be here:

select * from t1 where not exists(select 1 from t2 where t1_id=value);

To give a bit of background, I initially proposed the idea here:


And some issues were raised around the fact that updates to the referenced relation would only flush out changes to the referencing tables on completion of the command, and if we happened to be planning a query that was located inside a volatile function then we wouldn't know that the parent query hadn't updated some of these referenced tables.


In the attached I've used Noah's solution to the problem, and it seems to work just fine. (See regression test in the attached patch)


Where he mentioned that it may be possible that the foreign key trigger queue gets added to after planning has taken place. 
I've spent some time looking into this and I've not yet managed to find a case where this matters as it seems that updates made in 1 command are not visible to that same command. I've tested various different test cases in all transaction isolation levels and also tested update commands which call volatile functions that perform updates in the same table that the outer update will reach later in the command.

The patch (attached) is also now able to detect when a NOT EXISTS clause cannot produce any records at all.

If I make a simple change to the tables I defined above:

ALTER TABLE t1 ALTER COLUMN value SET NOT NULL;

Then the following will be produced:

explain (costs off) select * from t1 where not exists(select 1 from t2 where t1_id=value);
        QUERY PLAN
--------------------------
 Result
   One-Time Filter: false
   ->  Seq Scan on t1

A small note on my intentions with this patch:

I'm not seeing the use case for all of this to be massive, I'm more interested in this patch to use it as a stepping stone towards implementing INNER JOIN removals which would use foreign keys in a similar way to attempt to prove that the join is not required. I decided to tackle semi and anti joins first as these are a far more simple case, and it also adds quite a bit of the infrastructure that would be required for inner join removal, plus if nobody manages to poke holes in my ideas with this then I should have good grounds to begin the work on the inner join removal code. I also think if we're bothering to load foreign key constraints at planning time, then only using them for inner join removals wouldn't be making full use of them, so likely this patch would be a good idea anyway.

Currently most of my changes are in analyzejoin.c, but I did also have to make changes to load the foreign key constraints so that they were available to the planner. One thing that is currently lacking, which would likely be needed, before the finished patch is ready, would be a "relhasfkeys" column in pg_class. Such a column would mean that it would be possible to skip scanning pg_constraint for foreign keys when there's none to find. I'll delay implementing that until I get a bit more feedback to weather this patch would be a welcome addition to the existing join removal code or not.

I'm submitting this (a little early) for the August commitfest, but if anyone has any time to glance at it before then then that would be a really good help.

Regards

David Rowley



Attachment

pgsql-hackers by date:

Previous
From: Marti Raudsepp
Date:
Subject: Re: PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?
Next
From: Ashutosh Bapat
Date:
Subject: Re: Introducing coarse grain parallelism by postgres_fdw.