More correlated (?) index woes - Mailing list pgsql-general

From Geoff Winkless
Subject More correlated (?) index woes
Date
Msg-id CAEzk6fex6MRdMfGiiaJyA1U=Q+f0qF=G4iESXvHtaHLb0dfRpg@mail.gmail.com
Whole thread Raw
Responses Re: More correlated (?) index woes  (rob stone <floriparob@gmail.com>)
Re: More correlated (?) index woes  (Geoff Winkless <pgsqladmin@geoff.dj>)
List pgsql-general
So I accept that when using MIN(sc_id) against scdate it makes statistical sense to use the sc_id index for a reasonable percentage of the full range of scdate, unless we know in advance that scdate is closely correlated to sc_id (because using MIN means we can stop immediately we hit a value).

However I'm now finding a similar problem when using a multi-table DELETE, where the same obviously can't apply.

This query:

DELETE FROM pa USING legs WHERE legs.scdate BETWEEN 20160220 AND 20160222 AND legs.sc_id=pa.sc_id;

does what one would hope:

 Delete on pa  (cost=0.99..705406.24 rows=36570 width=12)
   ->  Nested Loop  (cost=0.99..705406.24 rows=36570 width=12)
         ->  Index Scan using legs_scdate_idx on legs  (cost=0.43..1171.88 rows=14458 width=10)
               Index Cond: ((scdate >= 20160220) AND (scdate <= 20160222))
         ->  Index Scan using pa_pkey on pa  (cost=0.56..48.33 rows=38 width=10)
               Index Cond: (sc_id = legs.sc_id)


However as soon as I add an extra test for field1 IS NULL, it apparently goes insane:

 Delete on pa  (cost=577260.90..626296.46 rows=23732 width=12) (actual time=41870.770..41870.770 rows=0 loops=1)
   ->  Hash Join  (cost=577260.90..626296.46 rows=23732 width=12) (actual time=37886.396..41315.668 rows=44960 loops=1)
         Hash Cond: (legs.sc_id = pa.sc_id)
         ->  Index Scan using legs_scdate_idx on legs  (cost=0.43..1171.88 rows=14458 width=10) (actual time=0.030..13.667 rows=21281 loops=1)
               Index Cond: ((scdate >= 20160220) AND (scdate <= 20160222))
         ->  Hash  (cost=481691.12..481691.12 rows=5497868 width=10) (actual time=37805.756..37805.756 rows=4875870 loops=1)
               Buckets: 131072  Batches: 64  Memory Usage: 4311kB
               ->  Seq Scan on pa  (cost=0.00..481691.12 rows=5497868 width=10) (actual time=0.008..35869.304 rows=4875870 loops=1)
                     Filter: (field1 IS NULL)
                     Rows Removed by Filter: 2688634
 Planning time: 0.447 ms
 Execution time: 41870.832 ms

Running ANALYZE makes no difference.

Table pa has 7522676 rows, 4834042 of which have field1 NULL, so it's absolutely not reasonable to expect this to be an optimal strategy.

Any suggestions as to how I can improve this query?

Thanks :)

Geoff

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: View deleted records in a table
Next
From: Mat Arye
Date:
Subject: Table size for partitioned setup