Re: More correlated (?) index woes - Mailing list pgsql-general

From rob stone
Subject Re: More correlated (?) index woes
Date
Msg-id 1459198864.4165.7.camel@gmail.com
Whole thread Raw
In response to More correlated (?) index woes  (Geoff Winkless <pgsqladmin@geoff.dj>)
Responses Re: More correlated (?) index woes  (Geoff Winkless <pgsqladmin@geoff.dj>)
List pgsql-general
On Mon, 2016-03-28 at 20:23 +0100, Geoff Winkless wrote:
> So I accept that when using MIN(sc_id) against scdate it makes
> statistical sense to use the sc_id index for a reasonable percentage
> of the full range of scdate, unless we know in advance that scdate is
> closely correlated to sc_id (because using MIN means we can stop
> immediately we hit a value).
>
> However I'm now finding a similar problem when using a multi-table
> DELETE, where the same obviously can't apply.
>
> This query:
>
> DELETE FROM pa USING legs WHERE legs.scdate BETWEEN 20160220 AND
> 20160222 AND legs.sc_id=pa.sc_id;
>
> does what one would hope:
>
>  Delete on pa  (cost=0.99..705406.24 rows=36570 width=12)
>    ->  Nested Loop  (cost=0.99..705406.24 rows=36570 width=12)
>          ->  Index Scan using legs_scdate_idx on legs
>  (cost=0.43..1171.88 rows=14458 width=10)
>                Index Cond: ((scdate >= 20160220) AND (scdate <=
> 20160222))
>          ->  Index Scan using pa_pkey on pa  (cost=0.56..48.33
> rows=38 width=10)
>                Index Cond: (sc_id = legs.sc_id)
>
>
> However as soon as I add an extra test for field1 IS NULL, it
> apparently goes insane:
>
>  Delete on pa  (cost=577260.90..626296.46 rows=23732 width=12)
> (actual time=41870.770..41870.770 rows=0 loops=1)
>    ->  Hash Join  (cost=577260.90..626296.46 rows=23732 width=12)
> (actual time=37886.396..41315.668 rows=44960 loops=1)
>          Hash Cond: (legs.sc_id = pa.sc_id)
>          ->  Index Scan using legs_scdate_idx on legs
>  (cost=0.43..1171.88 rows=14458 width=10) (actual time=0.030..13.667
> rows=21281 loops=1)
>                Index Cond: ((scdate >= 20160220) AND (scdate <=
> 20160222))
>          ->  Hash  (cost=481691.12..481691.12 rows=5497868 width=10)
> (actual time=37805.756..37805.756 rows=4875870 loops=1)
>                Buckets: 131072  Batches: 64  Memory Usage: 4311kB
>                ->  Seq Scan on pa  (cost=0.00..481691.12 rows=5497868
> width=10) (actual time=0.008..35869.304 rows=4875870 loops=1)
>                      Filter: (field1 IS NULL)
>                      Rows Removed by Filter: 2688634
>  Planning time: 0.447 ms
>  Execution time: 41870.832 ms
>
> Running ANALYZE makes no difference.
>
> Table pa has 7522676 rows, 4834042 of which have field1 NULL, so it's
> absolutely not reasonable to expect this to be an optimal strategy.
>
> Any suggestions as to how I can improve this query?
>
> Thanks :)
>
> Geoff



What does:-

DELETE FROM pa 
WHERE pa.field1 IS NULL
AND pa.sc_id IN (SELECT legs.sc_id FROM legs
WHERE legs.scdate BETWEEN 20160220 AND > 20160222)

give as a cost when you run ANALYZE over it?

HTH

Rob


pgsql-general by date:

Previous
From: Rob Sargent
Date:
Subject: Re: Table size for partitioned setup
Next
From: Geoff Winkless
Date:
Subject: Re: More correlated (?) index woes