Thread: Query planner not using indexes with JOIN query and OR clause

Query planner not using indexes with JOIN query and OR clause

From
Ben Hoyt
Date:
Hi folks,

I have a fairly simple three-table query (pasted below) with two LEFT JOINs and an OR in the WHERE clause that for some reason is doing sequential scans on all three tables (two of them large -- several million rows), even though I have indexes on the relevant "filename" columns.

Note the two parts of the where clause -- a filter on the image2 table and a filter on the panoramas table. If I comment out either filter and just filter on i.filename by itself, or p.filename by itself, the query planner uses the relevant index and the query takes a few milliseconds. But when I have both clauses (as shown below) it falls back to sequential scanning all three tables for some reason, taking several seconds.

What am I missing? There must be some reason PostgreSQL can't use the index in this case, but I can't see what it is. If I were PostgreSQL I'd be using the index on i.filename and p.filename to filter to a couple of rows first, then join, making it super-quick.

In this test I'm running PostgreSQL 9.3.3 on Windows 64-bit, but the same thing happens on our Linux-based database. We're using autovacuum on both databases, and I've tried manually VACUUM ANALYZE-ing all three tables just in case, but that doesn't help. Memory config values are set to sensible values. Note: this is a new query, so in terms of "history", it's always been slow.

My query and PostgreSQL version and the explain and a lot of other table data is pasted below.

QUERY
----------
select ai.position, i.filename as image_filename, p.filename as panorama_filename
from album_items ai
left join image2 i on i.imageid = ai.image_id
left join panoramas p on p.id = ai.panorama_id
where i.filename in ('pano360--v471.jpg', 'pano360-2--v474.jpg') or
      p.filename in ('pano360--v471', 'pano360-2--v474')
----------

POSTGRESQL VERSION
----------
PostgreSQL 9.3.3, compiled by Visual C++ build 1600, 64-bit
----------

EXPLAIN (ANALYZE, BUFFERS)
----------

Hash Left Join  (cost=344184.62..963863.99 rows=376 width=57) (actual time=3157.104..8838.329 rows=2 loops=1)
  Hash Cond: (ai.panorama_id = p.id)
  Filter: ((i.filename = ANY ('{pano360--v471.jpg,pano360-2--v474.jpg}'::text[])) OR (p.filename = ANY ('{pano360--v471,pano360-2--v474}'::text[])))
  Rows Removed by Filter: 7347790
  Buffers: shared hit=8967 read=198827, temp read=76936 written=75908
  I/O Timings: read=609.403
  ->  Hash Left Join  (cost=341001.56..781324.85 rows=7346959 width=39) (actual time=2660.821..7842.202 rows=7347792 loops=1)
        Hash Cond: (ai.image_id = i.imageid)
        Buffers: shared hit=6936 read=198827, temp read=76662 written=75640
        I/O Timings: read=609.403
        ->  Seq Scan on album_items ai  (cost=0.00..156576.59 rows=7346959 width=12) (actual time=0.009..981.074 rows=7347792 loops=1)
              Buffers: shared hit=4297 read=78810
              I/O Timings: read=251.402
        ->  Hash  (cost=194687.36..194687.36 rows=7203136 width=35) (actual time=2658.643..2658.643 rows=7200287 loops=1)
              Buckets: 2048  Batches: 512  Memory Usage: 976kB
              Buffers: shared hit=2639 read=120017, temp written=49961
              I/O Timings: read=358.000
              ->  Seq Scan on image2 i  (cost=0.00..194687.36 rows=7203136 width=35) (actual time=0.007..1063.586 rows=7200287 loops=1)
                    Buffers: shared hit=2639 read=120017
                    I/O Timings: read=358.000
  ->  Hash  (cost=2423.47..2423.47 rows=39247 width=26) (actual time=12.100..12.100 rows=39247 loops=1)
        Buckets: 2048  Batches: 4  Memory Usage: 575kB
        Buffers: shared hit=2031, temp written=170
        ->  Seq Scan on panoramas p  (cost=0.00..2423.47 rows=39247 width=26) (actual time=0.003..5.470 rows=39247 loops=1)
              Buffers: shared hit=2031
Total runtime: 8838.701 ms
----------

TABLE METADATA
----------
Number of rows in album_items: 7347792
Number of rows in image2: 7200287
Number of rows in panoramas: 39247
----------

TABLES (AND THEIR INDEXES) REFERENCED IN QUERY
----------
CREATE TABLE content.album_items
(
  album_id integer NOT NULL,
  image_id integer,
  "position" integer,
  caption text,
  active boolean NOT NULL,
  panorama_id integer,
  CONSTRAINT album_items_album_id_fkey FOREIGN KEY (album_id) REFERENCES content.albums (id),
  CONSTRAINT album_items_image_id_fkey FOREIGN KEY (image_id) REFERENCES content.image2 (imageid),
  CONSTRAINT album_items_panorama_id_fkey FOREIGN KEY (panorama_id) REFERENCES content.panoramas (id),
  CONSTRAINT album_image_unique UNIQUE (album_id, image_id)
);

CREATE INDEX album_items_album_id_idx ON content.album_items (album_id);
CREATE INDEX album_items_image_id_idx ON content.album_items (image_id);
CREATE INDEX album_items_panorama_id_idx ON content.album_items (panorama_id);


CREATE TABLE content.image2
(
  imageid integer NOT NULL DEFAULT nextval('image2_imageid_seq'::regclass),
  hotelid integer,
  filename text NOT NULL,
  originalfoldername text NOT NULL,
  width integer,
  height integer,
  active boolean NOT NULL DEFAULT false,
  importid integer,
  timetaken timestamp without time zone,
  state integer NOT NULL DEFAULT 1,
  has_wide boolean NOT NULL DEFAULT false,
  type integer,
  document_id integer,
  property_id integer,
  CONSTRAINT image2_pkey PRIMARY KEY (imageid),
  CONSTRAINT fk_image2_hotelid FOREIGN KEY (hotelid) REFERENCES content.hotel (hotelid),
  CONSTRAINT fk_image2_importid FOREIGN KEY (importid) REFERENCES content.imageimport (importid),
  CONSTRAINT image2_document_id_fkey FOREIGN KEY (document_id) REFERENCES content.documents (id),
  CONSTRAINT image2_property_id_fkey FOREIGN KEY (property_id) REFERENCES content.properties (id),
  CONSTRAINT uq_image2_filename UNIQUE (filename)
);

CREATE INDEX fki_image2_property_id_fkey ON content.image2 (property_id);
CREATE INDEX image2_document_id_idx ON content.image2 (document_id);
CREATE INDEX image2_importid_idx ON content.image2 (importid);
CREATE INDEX ix_image2_hotelid ON content.image2 (hotelid);
CREATE INDEX ix_image2_imageid ON content.image2 (imageid);


CREATE TABLE content.panoramas
(
  id integer NOT NULL DEFAULT nextval('panoramas_id_seq'::regclass),
  hotel_id integer,
  filename text NOT NULL,
  folder text NOT NULL,
  import_id integer NOT NULL,
  active boolean NOT NULL DEFAULT false,
  state integer NOT NULL,
  num_images integer NOT NULL DEFAULT 0,
  type integer,
  hdr boolean NOT NULL DEFAULT false,
  has_preview boolean NOT NULL DEFAULT false,
  property_id integer,
  data json,
  previews_created boolean NOT NULL DEFAULT false,
  CONSTRAINT panoramas_pkey PRIMARY KEY (id),
  CONSTRAINT fk_panoramas_hotel_id FOREIGN KEY (hotel_id) REFERENCES content.hotel (hotelid),
  CONSTRAINT fk_panoramas_import_id FOREIGN KEY (import_id) REFERENCES content.imageimport (importid),
  CONSTRAINT panoramas_property_id_fkey FOREIGN KEY (property_id) REFERENCES content.properties (id),
  CONSTRAINT panoramas_uq_filename UNIQUE (filename)
);

CREATE INDEX fki_panoramas_property_id_fkey ON content.panoramas (property_id);
CREATE INDEX panoramas_hotel_id_idx ON content.panoramas (hotel_id);
CREATE INDEX panoramas_import_id_idx ON content.panoramas (import_id);
----------

Thanks in advance,
Ben

Re: Query planner not using indexes with JOIN query and OR clause

From
Merlin Moncure
Date:
On Mon, Jul 13, 2015 at 3:54 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
> Hi folks,
>
> I have a fairly simple three-table query (pasted below) with two LEFT JOINs
> and an OR in the WHERE clause that for some reason is doing sequential scans
> on all three tables (two of them large -- several million rows), even though
> I have indexes on the relevant "filename" columns.
>
> Note the two parts of the where clause -- a filter on the image2 table and a
> filter on the panoramas table. If I comment out either filter and just
> filter on i.filename by itself, or p.filename by itself, the query planner
> uses the relevant index and the query takes a few milliseconds. But when I
> have both clauses (as shown below) it falls back to sequential scanning all
> three tables for some reason, taking several seconds.
>
> What am I missing? There must be some reason PostgreSQL can't use the index
> in this case, but I can't see what it is. If I were PostgreSQL I'd be using
> the index on i.filename and p.filename to filter to a couple of rows first,
> then join, making it super-quick.
>
> In this test I'm running PostgreSQL 9.3.3 on Windows 64-bit, but the same

FYI, this won'f fix your issue, but upgrade your postgres to the
latest bugfix release, 9.3.9.

> My query and PostgreSQL version and the explain and a lot of other table
> data is pasted below.
>
> QUERY
> ----------
> select ai.position, i.filename as image_filename, p.filename as
> panorama_filename
> from album_items ai
> left join image2 i on i.imageid = ai.image_id
> left join panoramas p on p.id = ai.panorama_id
> where i.filename in ('pano360--v471.jpg', 'pano360-2--v474.jpg') or
>       p.filename in ('pano360--v471', 'pano360-2--v474')

Try refactoring to:

select ai.position, i.filename as image_filename, p.filename as
panorama_filename
from album_items ai
left join image2 i on i.imageid = ai.image_id
left join panoramas p on p.id = ai.panorama_id
where i.filename in ('pano360--v471.jpg', 'pano360-2--v474.jpg')
union all select ai.position, i.filename as image_filename, p.filename
as panorama_filename
from album_items ai
left join image2 i on i.imageid = ai.image_id
left join panoramas p on p.id = ai.panorama_id
where  p.filename in ('pano360--v471', 'pano360-2--v474')

...and see if that helps.  Dealing with 'or' conditions is a general
weakness of the planner that has gotten better over time but in some
cases you have to boil it to 'union all'.

merlin


Re: Query planner not using indexes with JOIN query and OR clause

From
Merlin Moncure
Date:
On Mon, Jul 13, 2015 at 4:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, Jul 13, 2015 at 3:54 PM, Ben Hoyt <benhoyt@gmail.com> wrote:
>> Hi folks,
>>
>> I have a fairly simple three-table query (pasted below) with two LEFT JOINs
>> and an OR in the WHERE clause that for some reason is doing sequential scans
>> on all three tables (two of them large -- several million rows), even though
>> I have indexes on the relevant "filename" columns.
>>
>> Note the two parts of the where clause -- a filter on the image2 table and a
>> filter on the panoramas table. If I comment out either filter and just
>> filter on i.filename by itself, or p.filename by itself, the query planner
>> uses the relevant index and the query takes a few milliseconds. But when I
>> have both clauses (as shown below) it falls back to sequential scanning all
>> three tables for some reason, taking several seconds.
>>
>> What am I missing? There must be some reason PostgreSQL can't use the index
>> in this case, but I can't see what it is. If I were PostgreSQL I'd be using
>> the index on i.filename and p.filename to filter to a couple of rows first,
>> then join, making it super-quick.
>>
>> In this test I'm running PostgreSQL 9.3.3 on Windows 64-bit, but the same
>
> FYI, this won'f fix your issue, but upgrade your postgres to the
> latest bugfix release, 9.3.9.
>
>> My query and PostgreSQL version and the explain and a lot of other table
>> data is pasted below.
>>
>> QUERY
>> ----------
>> select ai.position, i.filename as image_filename, p.filename as
>> panorama_filename
>> from album_items ai
>> left join image2 i on i.imageid = ai.image_id
>> left join panoramas p on p.id = ai.panorama_id
>> where i.filename in ('pano360--v471.jpg', 'pano360-2--v474.jpg') or
>>       p.filename in ('pano360--v471', 'pano360-2--v474')
>
> Try refactoring to:
>
> select ai.position, i.filename as image_filename, p.filename as
> panorama_filename
> from album_items ai
> left join image2 i on i.imageid = ai.image_id
> left join panoramas p on p.id = ai.panorama_id
> where i.filename in ('pano360--v471.jpg', 'pano360-2--v474.jpg')
> union all select ai.position, i.filename as image_filename, p.filename
> as panorama_filename
> from album_items ai
> left join image2 i on i.imageid = ai.image_id
> left join panoramas p on p.id = ai.panorama_id
> where  p.filename in ('pano360--v471', 'pano360-2--v474')

one thing, if a row matches both conditions you may get duplicate
rows.   If that's an issue, one possible way to fix that would be to
convert 'union all' to 'union'.  I'd have to see what the query output
and how it performed.  Another way to tackle queries like this is with
EXISTS:

select ai.position, i.filename as image_filename, p.filename as
panorama_filename
from album_items ai
where exists (
  select 1 from image2 i
  where i.imageid = ai.image_id
    and i.filename in ('pano360--v471.jpg', 'pano360-2--v474.jpg')
  union all select 1 from panoramas p
  where p.id = ai.panorama_id
    and p.filename in ('pano360--v471', 'pano360-2--v474')
)

why do this? by pushing the UNION ALL into the exists statement, you
guarantee your self at most one row from album_items without having to
resort to performance dangerous tradeoffs like DISTINCT or UNION.
Give it a try...

merlin


Re: Query planner not using indexes with JOIN query and OR clause

From
Ben Hoyt
Date:
Try refactoring to:

select ai.position, i.filename as image_filename, p.filename as
panorama_filename
from album_items ai
left join image2 i on i.imageid = ai.image_id
left join panoramas p on p.id = ai.panorama_id
where i.filename in ('pano360--v471.jpg', 'pano360-2--v474.jpg')
union all select ai.position, i.filename as image_filename, p.filename
as panorama_filename
from album_items ai
left join image2 i on i.imageid = ai.image_id
left join panoramas p on p.id = ai.panorama_id
where  p.filename in ('pano360--v471', 'pano360-2--v474')

...and see if that helps.  Dealing with 'or' conditions is a general
weakness of the planner that has gotten better over time but in some
cases you have to boil it to 'union all'.

Yes, this definitely helps and the query performance goes back to normal, thanks. It makes the code a bit more complicated, so not ideal, but definitely works!

Thanks for the help. I don't how much you know about PostgreSQL internals (I don't!), but what optimization would need to be in place for PostgreSQL to be smarter about this query?

-Ben