Thread: More correlated (?) index woes

More correlated (?) index woes

From
Geoff Winkless
Date:
So I accept that when using MIN(sc_id) against scdate it makes statistical sense to use the sc_id index for a reasonable percentage of the full range of scdate, unless we know in advance that scdate is closely correlated to sc_id (because using MIN means we can stop immediately we hit a value).

However I'm now finding a similar problem when using a multi-table DELETE, where the same obviously can't apply.

This query:

DELETE FROM pa USING legs WHERE legs.scdate BETWEEN 20160220 AND 20160222 AND legs.sc_id=pa.sc_id;

does what one would hope:

 Delete on pa  (cost=0.99..705406.24 rows=36570 width=12)
   ->  Nested Loop  (cost=0.99..705406.24 rows=36570 width=12)
         ->  Index Scan using legs_scdate_idx on legs  (cost=0.43..1171.88 rows=14458 width=10)
               Index Cond: ((scdate >= 20160220) AND (scdate <= 20160222))
         ->  Index Scan using pa_pkey on pa  (cost=0.56..48.33 rows=38 width=10)
               Index Cond: (sc_id = legs.sc_id)


However as soon as I add an extra test for field1 IS NULL, it apparently goes insane:

 Delete on pa  (cost=577260.90..626296.46 rows=23732 width=12) (actual time=41870.770..41870.770 rows=0 loops=1)
   ->  Hash Join  (cost=577260.90..626296.46 rows=23732 width=12) (actual time=37886.396..41315.668 rows=44960 loops=1)
         Hash Cond: (legs.sc_id = pa.sc_id)
         ->  Index Scan using legs_scdate_idx on legs  (cost=0.43..1171.88 rows=14458 width=10) (actual time=0.030..13.667 rows=21281 loops=1)
               Index Cond: ((scdate >= 20160220) AND (scdate <= 20160222))
         ->  Hash  (cost=481691.12..481691.12 rows=5497868 width=10) (actual time=37805.756..37805.756 rows=4875870 loops=1)
               Buckets: 131072  Batches: 64  Memory Usage: 4311kB
               ->  Seq Scan on pa  (cost=0.00..481691.12 rows=5497868 width=10) (actual time=0.008..35869.304 rows=4875870 loops=1)
                     Filter: (field1 IS NULL)
                     Rows Removed by Filter: 2688634
 Planning time: 0.447 ms
 Execution time: 41870.832 ms

Running ANALYZE makes no difference.

Table pa has 7522676 rows, 4834042 of which have field1 NULL, so it's absolutely not reasonable to expect this to be an optimal strategy.

Any suggestions as to how I can improve this query?

Thanks :)

Geoff

Re: More correlated (?) index woes

From
rob stone
Date:
On Mon, 2016-03-28 at 20:23 +0100, Geoff Winkless wrote:
> So I accept that when using MIN(sc_id) against scdate it makes
> statistical sense to use the sc_id index for a reasonable percentage
> of the full range of scdate, unless we know in advance that scdate is
> closely correlated to sc_id (because using MIN means we can stop
> immediately we hit a value).
>
> However I'm now finding a similar problem when using a multi-table
> DELETE, where the same obviously can't apply.
>
> This query:
>
> DELETE FROM pa USING legs WHERE legs.scdate BETWEEN 20160220 AND
> 20160222 AND legs.sc_id=pa.sc_id;
>
> does what one would hope:
>
>  Delete on pa  (cost=0.99..705406.24 rows=36570 width=12)
>    ->  Nested Loop  (cost=0.99..705406.24 rows=36570 width=12)
>          ->  Index Scan using legs_scdate_idx on legs
>  (cost=0.43..1171.88 rows=14458 width=10)
>                Index Cond: ((scdate >= 20160220) AND (scdate <=
> 20160222))
>          ->  Index Scan using pa_pkey on pa  (cost=0.56..48.33
> rows=38 width=10)
>                Index Cond: (sc_id = legs.sc_id)
>
>
> However as soon as I add an extra test for field1 IS NULL, it
> apparently goes insane:
>
>  Delete on pa  (cost=577260.90..626296.46 rows=23732 width=12)
> (actual time=41870.770..41870.770 rows=0 loops=1)
>    ->  Hash Join  (cost=577260.90..626296.46 rows=23732 width=12)
> (actual time=37886.396..41315.668 rows=44960 loops=1)
>          Hash Cond: (legs.sc_id = pa.sc_id)
>          ->  Index Scan using legs_scdate_idx on legs
>  (cost=0.43..1171.88 rows=14458 width=10) (actual time=0.030..13.667
> rows=21281 loops=1)
>                Index Cond: ((scdate >= 20160220) AND (scdate <=
> 20160222))
>          ->  Hash  (cost=481691.12..481691.12 rows=5497868 width=10)
> (actual time=37805.756..37805.756 rows=4875870 loops=1)
>                Buckets: 131072  Batches: 64  Memory Usage: 4311kB
>                ->  Seq Scan on pa  (cost=0.00..481691.12 rows=5497868
> width=10) (actual time=0.008..35869.304 rows=4875870 loops=1)
>                      Filter: (field1 IS NULL)
>                      Rows Removed by Filter: 2688634
>  Planning time: 0.447 ms
>  Execution time: 41870.832 ms
>
> Running ANALYZE makes no difference.
>
> Table pa has 7522676 rows, 4834042 of which have field1 NULL, so it's
> absolutely not reasonable to expect this to be an optimal strategy.
>
> Any suggestions as to how I can improve this query?
>
> Thanks :)
>
> Geoff



What does:-

DELETE FROM pa 
WHERE pa.field1 IS NULL
AND pa.sc_id IN (SELECT legs.sc_id FROM legs
WHERE legs.scdate BETWEEN 20160220 AND > 20160222)

give as a cost when you run ANALYZE over it?

HTH

Rob


Re: More correlated (?) index woes

From
Geoff Winkless
Date:
On 28 March 2016 at 22:01, rob stone <floriparob@gmail.com> wrote:
What does:-

DELETE FROM pa 
WHERE pa.field1 IS NULL
AND pa.sc_id IN (SELECT legs.sc_id FROM legs
WHERE legs.scdate BETWEEN 20160220 AND > 20160222)

give as a cost when you run ANALYZE over it?

​Thanks for the suggestion.

It's a
​pproximately the same.

 Delete on pa  (cost=1463.31..493321.89 rows=187833 width=12) (actual time=41539.174..41539.174 rows=0 loops=1)
   ->  Hash Join  (cost=1463.31..493321.89 rows=187833 width=12) (actual time=41539.172..41539.172 rows=0 loops=1)
         Hash Cond: (pa.sc_id = legs.sc_id)
         ->  Seq Scan on pa  (cost=0.00..480888.83 rows=2899078 width=10) (actual time=0.010..40866.049 rows=2591264 loops=1)
               Filter: (field1 IS NULL)
               Rows Removed by Filter: 4931412
         ->  Hash  (cost=1321.48..1321.48 rows=11346 width=10) (actual time=29.481..29.481 rows=20940 loops=1)
               Buckets: 32768 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 1156kB
               ->  HashAggregate  (cost=1208.02..1321.48 rows=11346 width=10) (actual time=20.446..25.028 rows=20940 loops=1)
                     Group Key: legs.sc_id
                     ->  Index Scan using legs_scdate_idx on legs  (cost=0.43..1171.88 rows=14458 width=10) (actual time=0.025..13.133 rows=21281 loops=1)
                           Index Cond: ((scdate >= 20160220) AND (scdate <= 20160222))

I've tried creating a subquery out of the legs dataset in the hope that that would help but that made no difference either.

Geoff​

Re: More correlated (?) index woes

From
Geoff Winkless
Date:
On 28 March 2016 at 20:23, I wrote:
Table pa has 7522676 rows, 4834042 of which have field1 NULL, so it's absolutely not reasonable to expect this to be an optimal strategy.
It occurred to me that even though the majority of values are NULL, there are ​
 
​1691 unique values in pa.field1, so I suppose it might seem more attractive to the planner than it should do (that's more unique values than there are scdate entries).

I might just set enable_seqscan to false and leave it at that. It makes me unhappy though.

Geoff


Re: More correlated (?) index woes

From
Melvin Davidson
Date:


On Tue, Mar 29, 2016 at 6:47 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
On 28 March 2016 at 20:23, I wrote:
Table pa has 7522676 rows, 4834042 of which have field1 NULL, so it's absolutely not reasonable to expect this to be an optimal strategy.
It occurred to me that even though the majority of values are NULL, there are ​
 
​1691 unique values in pa.field1, so I suppose it might seem more attractive to the planner than it should do (that's more unique values than there are scdate entries).

I might just set enable_seqscan to false and leave it at that. It makes me unhappy though.

Geoff



>I might just set enable_seqscan to false

Geoff, that has worked for me in the past. It forces the use of index if available, but if there is no suitable index, it will do a seq scan anyway, so there is low risk in doing that.

--
Melvin Davidson
I reserve the right to fantasize.  Whether or not you
wish to share my fantasy is entirely up to you.

Re: More correlated (?) index woes

From
bricklen
Date:

On Tue, Mar 29, 2016 at 3:47 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
On 28 March 2016 at 20:23, I wrote:
Table pa has 7522676 rows, 4834042 of which have field1 NULL, so it's absolutely not reasonable to expect this to be an optimal strategy.
It occurred to me that even though the majority of values are NULL, there are ​
 
​1691 unique values in pa.field1, so I suppose it might seem more attractive to the planner than it should do (that's more unique values than there are scdate entries).


Perhaps a partial index like "create index pa_sc_id_pidx on pa (sc_id) with (fillfactor=100) where field1 IS NULL;" will help?

Re: More correlated (?) index woes

From
Geoff Winkless
Date:
On 31 Mar 2016 03:02, "bricklen" <bricklen@gmail.com> wrote:
> Perhaps a partial index like "create index pa_sc_id_pidx on pa (sc_id) with (fillfactor=100) where field1 IS NULL;"
willhelp? 

Thanks for the suggestion.

It might, but the problem with that is there's (something like)
field2-16 which are used in similar queries.

I'll stick with enable_seqscan=off, it seems to be doing the trick;
thanks though.

Geoff