Thread: Index scan is not pushed down to union all subquery
Hi, Is it intended that indexes are not pushed down to union all subqueries if even a single select contains a where clause? Is this just not implemented, is it impossible to implement or am I doing something wrong? The following query does a SeqScan for "bikes" and "cars" tables even though IndexScan on their column "dealer_name" would be a magnitude faster. (Schema with sample data at the bottom of this mail) WITH targets as ( select 'bike' vehicle, id, dealer_name FROM bikes WHERE frame_size = 52 union all select 'car' vehicle, id, dealer_name FROM cars -- In the real use case I have here dozens of tables ) SELECT dealers.name dealer, targets.vehicle, targets.id FROM dealers JOIN targets ON dealers.name = targets.dealer_name WHERE dealers.id in (54,12,456,315,468) If the WHERE clause from the "bikes" subquery is removed then first Index Scan on dealers_pk is made and then Bitmap Index Scans on "bikes" and "cars" table using indexes on "dealer_name" columns. --------------------------------------------------------------- -- Available also at: https://www.db-fiddle.com/f/cEXt8HXSaQzsQ2yBDA2Z4H/7 CREATE TABLE dealers AS SELECT id, (SELECT string_agg(CHR(65+(random() * 25)::integer), '') FROM generate_series(1, 4) WHERE id>0) name FROM generate_series(1, 1000) AS id ; ALTER TABLE dealers ADD primary key (id); CREATE INDEX ON dealers(name); CREATE TABLE bikes AS SELECT generate_series AS id, (SELECT name FROM dealers WHERE dealers.id = (SELECT (random()*1000)::int WHERE generate_series>0)) AS dealer_name, (random()*12+50)::int as frame_size FROM generate_series(1, 100000); ALTER TABLE bikes ADD primary key (id); CREATE INDEX ON bikes(dealer_name); CREATE TABLE cars AS SELECT generate_series as id, (SELECT name FROM dealers WHERE dealers.id = (SELECT (random()*1000)::int WHERE generate_series>0)) AS dealer_name, (random()*7+14)::int as wheel_size FROM generate_series(1, 100000); ALTER TABLE cars ADD primary key (id); CREATE INDEX ON cars(dealer_name); ANALYZE; -- - Lauri
Hi, Is it intended that indexes are not pushed down to union all subqueries if even a single select contains a where clause? Is this just not implemented, is it impossible to implement or am I doing something wrong? The following query does a SeqScan for "bikes" and "cars" tables even though IndexScan on their column "dealer_name" would be a magnitude faster. (Schema with sample data at the bottom of this mail) WITH targets as ( select 'bike' vehicle, id, dealer_name FROM bikes WHERE frame_size = 52 union all select 'car' vehicle, id, dealer_name FROM cars -- In the real use case I have here dozens of tables ) SELECT dealers.name dealer, targets.vehicle, targets.id FROM dealers JOIN targets ON dealers.name = targets.dealer_name WHERE dealers.id in (54,12,456,315,468) If the WHERE clause from the "bikes" subquery is removed then first Index Scan on dealers_pk is made and then Bitmap Index Scans on "bikes" and "cars" table using indexes on "dealer_name" columns. --------------------------------------------------------------- -- Available also at: https://www.db-fiddle.com/f/cEXt8HXSaQzsQ2yBDA2Z4H/7 CREATE TABLE dealers AS SELECT id, (SELECT string_agg(CHR(65+(random() * 25)::integer), '') FROM generate_series(1, 4) WHERE id>0) name FROM generate_series(1, 1000) AS id ; ALTER TABLE dealers ADD primary key (id); CREATE INDEX ON dealers(name); CREATE TABLE bikes AS SELECT generate_series AS id, (SELECT name FROM dealers WHERE dealers.id = (SELECT (random()*1000)::int WHERE generate_series>0)) AS dealer_name, (random()*12+50)::int as frame_size FROM generate_series(1, 100000); ALTER TABLE bikes ADD primary key (id); CREATE INDEX ON bikes(dealer_name); CREATE TABLE cars AS SELECT generate_series as id, (SELECT name FROM dealers WHERE dealers.id = (SELECT (random()*1000)::int WHERE generate_series>0)) AS dealer_name, (random()*7+14)::int as wheel_size FROM generate_series(1, 100000); ALTER TABLE cars ADD primary key (id); CREATE INDEX ON cars(dealer_name); ANALYZE; -- - Lauri
With an INNER JOIN, both tables must be fully checked/matched (check using EXPLAIN ANALYSIS -> HashJoin), so the index cannot be used here.
--
regards, marian wendt
regards, marian wendt
Hi, Is it intended that indexes are not pushed down to union all subqueries if even a single select contains a where clause? Is this just not implemented, is it impossible to implement or am I doing something wrong? The following query does a SeqScan for "bikes" and "cars" tables even though IndexScan on their column "dealer_name" would be a magnitude faster. (Schema with sample data at the bottom of this mail) WITH targets as ( select 'bike' vehicle, id, dealer_name FROM bikes WHERE frame_size = 52 union all select 'car' vehicle, id, dealer_name FROM cars -- In the real use case I have here dozens of tables ) SELECT dealers.name dealer, targets.vehicle, targets.id FROM dealers JOIN targets ON dealers.name = targets.dealer_name WHERE dealers.id in (54,12,456,315,468) If the WHERE clause from the "bikes" subquery is removed then first Index Scan on dealers_pk is made and then Bitmap Index Scans on "bikes" and "cars" table using indexes on "dealer_name" columns. --------------------------------------------------------------- -- Available also at: https://www.db-fiddle.com/f/cEXt8HXSaQzsQ2yBDA2Z4H/7 CREATE TABLE dealers AS SELECT id, (SELECT string_agg(CHR(65+(random() * 25)::integer), '') FROM generate_series(1, 4) WHERE id>0) name FROM generate_series(1, 1000) AS id ; ALTER TABLE dealers ADD primary key (id); CREATE INDEX ON dealers(name); CREATE TABLE bikes AS SELECT generate_series AS id, (SELECT name FROM dealers WHERE dealers.id = (SELECT (random()*1000)::int WHERE generate_series>0)) AS dealer_name, (random()*12+50)::int as frame_size FROM generate_series(1, 100000); ALTER TABLE bikes ADD primary key (id); CREATE INDEX ON bikes(dealer_name); CREATE TABLE cars AS SELECT generate_series as id, (SELECT name FROM dealers WHERE dealers.id = (SELECT (random()*1000)::int WHERE generate_series>0)) AS dealer_name, (random()*7+14)::int as wheel_size FROM generate_series(1, 100000); ALTER TABLE cars ADD primary key (id); CREATE INDEX ON cars(dealer_name); ANALYZE; -- - LauriWith an INNER JOIN, both tables must be fully checked/matched (check using EXPLAIN ANALYSIS -> HashJoin), so the index cannot be used here.Sorry, didn't consider the WITH part. Please share the detailed query plan for more info.
--
regards, marian wendt
regards, marian wendt
Hi, Is it intended that indexes are not pushed down to union all subqueries if even a single select contains a where clause? Is this just not implemented, is it impossible to implement or am I doing something wrong? The following query does a SeqScan for "bikes" and "cars" tables even though IndexScan on their column "dealer_name" would be a magnitude faster. (Schema with sample data at the bottom of this mail) WITH targets as ( select 'bike' vehicle, id, dealer_name FROM bikes WHERE frame_size = 52 union all select 'car' vehicle, id, dealer_name FROM cars -- In the real use case I have here dozens of tables ) SELECT dealers.name dealer, targets.vehicle, targets.id FROM dealers JOIN targets ON dealers.name = targets.dealer_name WHERE dealers.id in (54,12,456,315,468) If the WHERE clause from the "bikes" subquery is removed then first Index Scan on dealers_pk is made and then Bitmap Index Scans on "bikes" and "cars" table using indexes on "dealer_name" columns. --------------------------------------------------------------- -- Available also at: https://www.db-fiddle.com/f/cEXt8HXSaQzsQ2yBDA2Z4H/7 CREATE TABLE dealers AS SELECT id, (SELECT string_agg(CHR(65+(random() * 25)::integer), '') FROM generate_series(1, 4) WHERE id>0) name FROM generate_series(1, 1000) AS id ; ALTER TABLE dealers ADD primary key (id); CREATE INDEX ON dealers(name); CREATE TABLE bikes AS SELECT generate_series AS id, (SELECT name FROM dealers WHERE dealers.id = (SELECT (random()*1000)::int WHERE generate_series>0)) AS dealer_name, (random()*12+50)::int as frame_size FROM generate_series(1, 100000); ALTER TABLE bikes ADD primary key (id); CREATE INDEX ON bikes(dealer_name); CREATE TABLE cars AS SELECT generate_series as id, (SELECT name FROM dealers WHERE dealers.id = (SELECT (random()*1000)::int WHERE generate_series>0)) AS dealer_name, (random()*7+14)::int as wheel_size FROM generate_series(1, 100000); ALTER TABLE cars ADD primary key (id); CREATE INDEX ON cars(dealer_name); ANALYZE; -- - LauriWith an INNER JOIN, both tables must be fully checked/matched (check using EXPLAIN ANALYSIS -> HashJoin), so the index cannot be used here.Sorry, didn't consider the WITH part. Please share the detailed query plan for more info.
The "bikes" subquery uses field "frame_size" in WHERE clause but the field does not have an index...ADD: Consider whether it might make sense to take a more generalist approach by only having one entity vehicle with the distinction "car", "bike", etc...?
ADD: Consider to do more complex "detailed" SELECTs that are unioned (if that is really needed)?
--
regards, marian wendt
regards, marian wendt
On Thu, Oct 5, 2023 at 11:35 AM Marian Wendt <marian.wendt@yahoo.com> wrote:
With an INNER JOIN, both tables must be fully checked/matched (check using EXPLAIN ANALYSIS -> HashJoin), so the index cannot be used here.Sorry, didn't consider the WITH part. Please share the detailed query plan for more info.The "bikes" subquery uses field "frame_size" in WHERE clause but the field does not have an index...ADD: Consider whether it might make sense to take a more generalist approach by only having one entity vehicle with the distinction "car", "bike", etc...?
ADD: Consider to do more complex "detailed" SELECTs that are unioned (if that is really needed)?
Marian, Lauri's question is clearly about the planner, and not asking about writing the SQL differently, or changing the data model.
Her sample data puts a 1% chance of cars or bikes matching a dealer, so using the indexes that exist should be preferred over a full scan.
She also implies that w/o the WHERE clause in the CTE's union-all query, the outer JOIN-clause would be pushed down (seems to me),
resulting in likely using the indexes. Lauri, you haven't said which version of PostgreSQL. Did you assume the latest v16? My $0.02. --DD
On Thu, Oct 5, 2023 at 12:30 PM Marian Wendt <marian.wendt@yahoo.com> wrote:
>
> With an INNER JOIN, both tables must be fully checked/matched (check using EXPLAIN ANALYSIS -> HashJoin), so the index cannot be used here.
> Sorry, didn't consider the WITH part. Please share the detailed query plan for more info.
Now, it reads all the 100 000 bikes, filters out 91 778 rows then appends all the 100 000 cars. Then it uses a hash join to match these 108 222 rows to 5 dealers.
In my opinion the index could be used. What I would do if I were a database, I would first find names of the dealers (in this case 5 dealers out of 1000) using the dealer's primary key then find all the bikes that match with the dealer name using the index on dealer_name column (in sample data ~500 bikes out of 100 000). Then filter those bikes with the frame_size (filter out ~460 bikes). Append to this result set all the cars matching the dealer_name condition (~500 cars out of 100 000) (again find using the index on the dealer_name column).
Here's the current query plan:
Hash Join (cost=21.53..4511.74 rows=542 width=41) (actual time=0.233..27.507 rows=479 loops=1)
Hash Cond: (bikes.dealer_name = dealers.name)
-> Append (cost=0.00..3943.40 rows=108280 width=41) (actual time=0.009..20.351 rows=108222 loops=1)
-> Seq Scan on bikes (cost=0.00..1826.00 rows=8280 width=41) (actual time=0.009..6.074 rows=8222 loops=1)
Filter: (frame_size = 52)
Rows Removed by Filter: 91778
-> Seq Scan on cars (cost=0.00..1576.00 rows=100000 width=41) (actual time=0.011..9.175 rows=100000 loops=1)
-> Hash (cost=21.46..21.46 rows=5 width=5) (actual time=0.041..0.043 rows=5 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Scan using dealers_pkey on dealers (cost=0.28..21.46 rows=5 width=5) (actual time=0.009..0.038 rows=5 loops=1)
Index Cond: (id = ANY ('{54,12,456,315,468}'::integer[]))
Planning Time: 0.152 ms
Execution Time: 27.558 ms
In my dreams the plan would be something like this:
Nested Loop
-> Index Scan using dealers_pkey on dealers
Index Cond: (id = ANY ('{54,12,456,315,468}'::integer[]))
-> Append
-> Bitmap Heap Scan on bikes
Recheck Cond: (dealer_name = dealers.name)
Filter: (frame_size = 52)
Rows Removed by Filter: 91
-> Bitmap Index Scan on bikes_dealer_name_idx
Index Cond: (dealer_name = dealers.name)
-> Bitmap Heap Scan on cars
Recheck Cond: (dealer_name = dealers.name)
-> Bitmap Index Scan on cars_dealer_name_idx
Index Cond: (dealer_name = dealers.name)
If I don't add the WHERE condition for bikes I get exactly that but without the filter node. Without the frame_size filter the Execution Time is 1.028 ms
On Thu, Oct 5, 2023 at 12:38 PM Marian Wendt <marian.wendt@yahoo.com> wrote:
>
> The "bikes" subquery uses field "frame_size" in WHERE clause but the field does not have an index...
> ADD: Consider whether it might make sense to take a more generalist approach by only having one entity vehicle with the distinction "car", "bike", etc...?
> ADD: Consider to do more complex "detailed" SELECTs that are unioned (if that is really needed)?
I don't want to index frame size since I don't use it to look up rows by itself. It should only be used to filter only interested bikes.
This is just a sample case and in my real use case I want to validate my "source" objects against 80 "target" tables. I could have separate queries for each target table but that adds overhead in network latencies etc. if having over 80 separate queries.
And if doing the join separately for each target table and then joining the results together that is also slow since it accesses the delalers_pk index that 80 times.
On Thu, Oct 5, 2023 at 12:47 PM Dominique Devienne <ddevienne@gmail.com> wrote:
>
>
> With an INNER JOIN, both tables must be fully checked/matched (check using EXPLAIN ANALYSIS -> HashJoin), so the index cannot be used here.
> Sorry, didn't consider the WITH part. Please share the detailed query plan for more info.
Now, it reads all the 100 000 bikes, filters out 91 778 rows then appends all the 100 000 cars. Then it uses a hash join to match these 108 222 rows to 5 dealers.
In my opinion the index could be used. What I would do if I were a database, I would first find names of the dealers (in this case 5 dealers out of 1000) using the dealer's primary key then find all the bikes that match with the dealer name using the index on dealer_name column (in sample data ~500 bikes out of 100 000). Then filter those bikes with the frame_size (filter out ~460 bikes). Append to this result set all the cars matching the dealer_name condition (~500 cars out of 100 000) (again find using the index on the dealer_name column).
Here's the current query plan:
Hash Join (cost=21.53..4511.74 rows=542 width=41) (actual time=0.233..27.507 rows=479 loops=1)
Hash Cond: (bikes.dealer_name = dealers.name)
-> Append (cost=0.00..3943.40 rows=108280 width=41) (actual time=0.009..20.351 rows=108222 loops=1)
-> Seq Scan on bikes (cost=0.00..1826.00 rows=8280 width=41) (actual time=0.009..6.074 rows=8222 loops=1)
Filter: (frame_size = 52)
Rows Removed by Filter: 91778
-> Seq Scan on cars (cost=0.00..1576.00 rows=100000 width=41) (actual time=0.011..9.175 rows=100000 loops=1)
-> Hash (cost=21.46..21.46 rows=5 width=5) (actual time=0.041..0.043 rows=5 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Scan using dealers_pkey on dealers (cost=0.28..21.46 rows=5 width=5) (actual time=0.009..0.038 rows=5 loops=1)
Index Cond: (id = ANY ('{54,12,456,315,468}'::integer[]))
Planning Time: 0.152 ms
Execution Time: 27.558 ms
In my dreams the plan would be something like this:
Nested Loop
-> Index Scan using dealers_pkey on dealers
Index Cond: (id = ANY ('{54,12,456,315,468}'::integer[]))
-> Append
-> Bitmap Heap Scan on bikes
Recheck Cond: (dealer_name = dealers.name)
Filter: (frame_size = 52)
Rows Removed by Filter: 91
-> Bitmap Index Scan on bikes_dealer_name_idx
Index Cond: (dealer_name = dealers.name)
-> Bitmap Heap Scan on cars
Recheck Cond: (dealer_name = dealers.name)
-> Bitmap Index Scan on cars_dealer_name_idx
Index Cond: (dealer_name = dealers.name)
If I don't add the WHERE condition for bikes I get exactly that but without the filter node. Without the frame_size filter the Execution Time is 1.028 ms
On Thu, Oct 5, 2023 at 12:38 PM Marian Wendt <marian.wendt@yahoo.com> wrote:
>
> The "bikes" subquery uses field "frame_size" in WHERE clause but the field does not have an index...
> ADD: Consider whether it might make sense to take a more generalist approach by only having one entity vehicle with the distinction "car", "bike", etc...?
> ADD: Consider to do more complex "detailed" SELECTs that are unioned (if that is really needed)?
I don't want to index frame size since I don't use it to look up rows by itself. It should only be used to filter only interested bikes.
This is just a sample case and in my real use case I want to validate my "source" objects against 80 "target" tables. I could have separate queries for each target table but that adds overhead in network latencies etc. if having over 80 separate queries.
And if doing the join separately for each target table and then joining the results together that is also slow since it accesses the delalers_pk index that 80 times.
On Thu, Oct 5, 2023 at 12:47 PM Dominique Devienne <ddevienne@gmail.com> wrote:
>
> ...
> Lauri, you haven't said which version of PostgreSQL. Did you assume the latest v16?
Thanks Dominique for clarifying the question.
I have tested this on Postgres 12 and also with the latest 16.0
> Lauri, you haven't said which version of PostgreSQL. Did you assume the latest v16?
Thanks Dominique for clarifying the question.
I have tested this on Postgres 12 and also with the latest 16.0
On Thu, Oct 5, 2023 at 4:25 PM Lauri Kajan <lauri.kajan@gmail.com> wrote:
In my dreams the plan would be something like this:
Nested Loop
-> Index Scan using dealers_pkey on dealers
Index Cond: (id = ANY ('{54,12,456,315,468}'::integer[]))
-> Append
-> Bitmap Heap Scan on bikes
Recheck Cond: (dealer_name = dealers.name)
Filter: (frame_size = 52)
Rows Removed by Filter: 91
-> Bitmap Index Scan on bikes_dealer_name_idx
Index Cond: (dealer_name = dealers.name)
-> Bitmap Heap Scan on cars
Recheck Cond: (dealer_name = dealers.name)
-> Bitmap Index Scan on cars_dealer_name_idx
Index Cond: (dealer_name = dealers.name)
OK, I'm getting pretty close.
With the following query where I select the filtered column in the union all and add the where clause to the top level query I get exactly the query plan I want.
EXPLAIN (ANALYZE)
WITH
targets as (
select 'bike' vehicle, id, dealer_name, frame_size as filter FROM bikes
union all
select 'car' vehicle, id, dealer_name, null as filter FROM cars
-- In the real use case I have here dozens of tables
)
SELECT
dealers.name dealer,
targets.vehicle,
targets.id
FROM
dealers
JOIN targets
ON dealers.name = targets.dealer_name
WHERE
dealers.id in (54,12,456,887,468)
and (filter is null or filter = 52)
EXPLAIN (ANALYZE)
WITH
targets as (
select 'bike' vehicle, id, dealer_name, frame_size as filter FROM bikes
union all
select 'car' vehicle, id, dealer_name, null as filter FROM cars
-- In the real use case I have here dozens of tables
)
SELECT
dealers.name dealer,
targets.vehicle,
targets.id
FROM
dealers
JOIN targets
ON dealers.name = targets.dealer_name
WHERE
dealers.id in (54,12,456,887,468)
and (filter is null or filter = 52)
But! This is not quite usable since it is tedious to write the query when there are filters in multiple tables and all different columns must be added to all the subqueries.
Regardless of that this kind of proves that the desired plan is possible to run with Postgres. So maybe this is just a missing feature in the Optimizer.
-Lauri