Thread: [PERFORM]

[PERFORM]

From
Yevhenii Kurtov
Date:
Hello,

We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The query is: 

SELECT c0."id" FROM "campaign_jobs" AS c0
WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
OR ((c0."status" = $3) AND (c0."failed_at" > $4))
OR ((c0."status" = $5) AND (c0."started_at" < $6))
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT $7
FOR UPDATE SKIP LOCKED

I added following index: 

CREATE INDEX ON campaign_jobs(id, status, failed_at, started_at, priority DESC, times_failed);

And it didn't help at all, even opposite - the planning phase time grew up from ~2ms  up to ~40 ms leaving execution time intact:

 Limit  (cost=29780.02..29781.27 rows=100 width=18) (actual time=827.753..828.113 rows=100 loops=1)
   ->  LockRows  (cost=29780.02..32279.42 rows=199952 width=18) (actual time=827.752..828.096 rows=100 loops=1)
         ->  Sort  (cost=29780.02..30279.90 rows=199952 width=18) (actual time=827.623..827.653 rows=100 loops=1)
               Sort Key: priority DESC, times_failed
               Sort Method: external sort  Disk: 5472kB
               ->  Seq Scan on campaign_jobs c0  (cost=0.00..22138.00 rows=199952 width=18) (actual time=1.072..321.410 rows=200000 loops=1)
                     Filter: (((status = 0) AND (id <> ALL ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48}'::integer[]))) OR ((status = 2) AND (failed_at > '2017-06-22 03:18:09'::timestamp without time zone)) OR ((status = 1) AND (started_at < '2017-06-23 03:11:09'::timestamp without time zone)))
 Planning time: 40.734 ms
 Execution time: 913.638 ms
(9 rows)


I see that query still went through the Seq Scan instead of Index Scan. Is it due to poorly crafted index or because of query structure? Is it possible to make this query faster?


Thanks

Re: [PERFORM]

From
Pavel Stehule
Date:


2017-06-28 8:47 GMT+02:00 Yevhenii Kurtov <yevhenii.kurtov@gmail.com>:
Hello,

We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The query is: 

SELECT c0."id" FROM "campaign_jobs" AS c0
WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
OR ((c0."status" = $3) AND (c0."failed_at" > $4))
OR ((c0."status" = $5) AND (c0."started_at" < $6))
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT $7
FOR UPDATE SKIP LOCKED

I added following index: 

CREATE INDEX ON campaign_jobs(id, status, failed_at, started_at, priority DESC, times_failed);

And it didn't help at all, even opposite - the planning phase time grew up from ~2ms  up to ~40 ms leaving execution time intact:

 Limit  (cost=29780.02..29781.27 rows=100 width=18) (actual time=827.753..828.113 rows=100 loops=1)
   ->  LockRows  (cost=29780.02..32279.42 rows=199952 width=18) (actual time=827.752..828.096 rows=100 loops=1)
         ->  Sort  (cost=29780.02..30279.90 rows=199952 width=18) (actual time=827.623..827.653 rows=100 loops=1)
               Sort Key: priority DESC, times_failed
               Sort Method: external sort  Disk: 5472kB
               ->  Seq Scan on campaign_jobs c0  (cost=0.00..22138.00 rows=199952 width=18) (actual time=1.072..321.410 rows=200000 loops=1)
                     Filter: (((status = 0) AND (id <> ALL ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48}'::integer[]))) OR ((status = 2) AND (failed_at > '2017-06-22 03:18:09'::timestamp without time zone)) OR ((status = 1) AND (started_at < '2017-06-23 03:11:09'::timestamp without time zone)))
 Planning time: 40.734 ms
 Execution time: 913.638 ms
(9 rows)


I see that query still went through the Seq Scan instead of Index Scan. Is it due to poorly crafted index or because of query structure? Is it possible to make this query faster?

There are few issues 

a) parametrized LIMIT
b) complex predicate with lot of OR 
c)  slow external sort

b) signalize maybe some strange in design .. try to replace "OR" by "UNION" query
c) if you can and you have good enough memory .. try to increase work_mem .. maybe 20MB

if you change query to union queries, then you can use conditional indexes

create index(id) where status = 0;
create index(failed_at) where status = 2;
create index(started_at) where status = 1;

Regards

Pavel



Thanks

Re: [PERFORM]

From
Yevhenii Kurtov
Date:
Hello Pavel,

Can you please give a tip how to rewrite the query with UNION clause? I didn't use it at all before actually and afraid that will not get it properly from the first time :)

On Wed, Jun 28, 2017 at 2:12 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:


2017-06-28 8:47 GMT+02:00 Yevhenii Kurtov <yevhenii.kurtov@gmail.com>:
Hello,

We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The query is: 

SELECT c0."id" FROM "campaign_jobs" AS c0
WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
OR ((c0."status" = $3) AND (c0."failed_at" > $4))
OR ((c0."status" = $5) AND (c0."started_at" < $6))
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT $7
FOR UPDATE SKIP LOCKED

I added following index: 

CREATE INDEX ON campaign_jobs(id, status, failed_at, started_at, priority DESC, times_failed);

And it didn't help at all, even opposite - the planning phase time grew up from ~2ms  up to ~40 ms leaving execution time intact:

 Limit  (cost=29780.02..29781.27 rows=100 width=18) (actual time=827.753..828.113 rows=100 loops=1)
   ->  LockRows  (cost=29780.02..32279.42 rows=199952 width=18) (actual time=827.752..828.096 rows=100 loops=1)
         ->  Sort  (cost=29780.02..30279.90 rows=199952 width=18) (actual time=827.623..827.653 rows=100 loops=1)
               Sort Key: priority DESC, times_failed
               Sort Method: external sort  Disk: 5472kB
               ->  Seq Scan on campaign_jobs c0  (cost=0.00..22138.00 rows=199952 width=18) (actual time=1.072..321.410 rows=200000 loops=1)
                     Filter: (((status = 0) AND (id <> ALL ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48}'::integer[]))) OR ((status = 2) AND (failed_at > '2017-06-22 03:18:09'::timestamp without time zone)) OR ((status = 1) AND (started_at < '2017-06-23 03:11:09'::timestamp without time zone)))
 Planning time: 40.734 ms
 Execution time: 913.638 ms
(9 rows)


I see that query still went through the Seq Scan instead of Index Scan. Is it due to poorly crafted index or because of query structure? Is it possible to make this query faster?

There are few issues 

a) parametrized LIMIT
b) complex predicate with lot of OR 
c)  slow external sort

b) signalize maybe some strange in design .. try to replace "OR" by "UNION" query
c) if you can and you have good enough memory .. try to increase work_mem .. maybe 20MB

if you change query to union queries, then you can use conditional indexes

create index(id) where status = 0;
create index(failed_at) where status = 2;
create index(started_at) where status = 1;

Regards

Pavel



Thanks


Re: [PERFORM]

From
Pavel Stehule
Date:


2017-06-28 9:28 GMT+02:00 Yevhenii Kurtov <yevhenii.kurtov@gmail.com>:
Hello Pavel,

Can you please give a tip how to rewrite the query with UNION clause? I didn't use it at all before actually and afraid that will not get it properly from the first time :)

SELECT c0."id" FROM "campaign_jobs" AS c0
WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
UNION SELECT c0."id" FROM "campaign_jobs" AS c0
WHERE ((c0."status" = $3) AND (c0."failed_at" > $4))
UNION SELECT c0."id" FROM "campaign_jobs" AS c0
WHERE ((c0."status" = $5) AND (c0."started_at" < $6))
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT $7
FOR UPDATE SKIP LOCKED

Something like this

Pavel
 

On Wed, Jun 28, 2017 at 2:12 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:


2017-06-28 8:47 GMT+02:00 Yevhenii Kurtov <yevhenii.kurtov@gmail.com>:
Hello,

We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The query is: 

SELECT c0."id" FROM "campaign_jobs" AS c0
WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
OR ((c0."status" = $3) AND (c0."failed_at" > $4))
OR ((c0."status" = $5) AND (c0."started_at" < $6))
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT $7
FOR UPDATE SKIP LOCKED

I added following index: 

CREATE INDEX ON campaign_jobs(id, status, failed_at, started_at, priority DESC, times_failed);

And it didn't help at all, even opposite - the planning phase time grew up from ~2ms  up to ~40 ms leaving execution time intact:

 Limit  (cost=29780.02..29781.27 rows=100 width=18) (actual time=827.753..828.113 rows=100 loops=1)
   ->  LockRows  (cost=29780.02..32279.42 rows=199952 width=18) (actual time=827.752..828.096 rows=100 loops=1)
         ->  Sort  (cost=29780.02..30279.90 rows=199952 width=18) (actual time=827.623..827.653 rows=100 loops=1)
               Sort Key: priority DESC, times_failed
               Sort Method: external sort  Disk: 5472kB
               ->  Seq Scan on campaign_jobs c0  (cost=0.00..22138.00 rows=199952 width=18) (actual time=1.072..321.410 rows=200000 loops=1)
                     Filter: (((status = 0) AND (id <> ALL ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48}'::integer[]))) OR ((status = 2) AND (failed_at > '2017-06-22 03:18:09'::timestamp without time zone)) OR ((status = 1) AND (started_at < '2017-06-23 03:11:09'::timestamp without time zone)))
 Planning time: 40.734 ms
 Execution time: 913.638 ms
(9 rows)


I see that query still went through the Seq Scan instead of Index Scan. Is it due to poorly crafted index or because of query structure? Is it possible to make this query faster?

There are few issues 

a) parametrized LIMIT
b) complex predicate with lot of OR 
c)  slow external sort

b) signalize maybe some strange in design .. try to replace "OR" by "UNION" query
c) if you can and you have good enough memory .. try to increase work_mem .. maybe 20MB

if you change query to union queries, then you can use conditional indexes

create index(id) where status = 0;
create index(failed_at) where status = 2;
create index(started_at) where status = 1;

Regards

Pavel



Thanks



Re: [PERFORM]

From
Gerardo Herzig
Date:

----- Mensaje original -----
> De: "Yevhenii Kurtov" <yevhenii.kurtov@gmail.com>
> Para: pgsql-performance@postgresql.org
> Enviados: Miércoles, 28 de Junio 2017 3:47:44
> Asunto: [PERFORM]
>
> Hello,
>
> We have a query that is run almost each second and it's very important to
> squeeze every other ms out of it. The query is:
>
> SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
> OR ((c0."status" = $3) AND (c0."failed_at" > $4))
> OR ((c0."status" = $5) AND (c0."started_at" < $6))
> ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT $7
> FOR UPDATE SKIP LOCKED
>
> I added following index:
>
> CREATE INDEX ON campaign_jobs(id, status, failed_at, started_at, priority
> DESC, times_failed);
>
> And it didn't help at all, even opposite - the planning phase time grew up
> from ~2ms  up to ~40 ms leaving execution time intact:
>
>  Limit  (cost=29780.02..29781.27 rows=100 width=18) (actual
> time=827.753..828.113 rows=100 loops=1)
>    ->  LockRows  (cost=29780.02..32279.42 rows=199952 width=18) (actual
> time=827.752..828.096 rows=100 loops=1)
>          ->  Sort  (cost=29780.02..30279.90 rows=199952 width=18) (actual
> time=827.623..827.653 rows=100 loops=1)
>                Sort Key: priority DESC, times_failed
>                Sort Method: external sort  Disk: 5472kB
>                ->  Seq Scan on campaign_jobs c0  (cost=0.00..22138.00
> rows=199952 width=18) (actual time=1.072..321.410 rows=200000 loops=1)
>                      Filter: (((status = 0) AND (id <> ALL
> ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
> 23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,
> 43,44,45,46,47,48}'::integer[]))) OR ((status = 2) AND (failed_at >
> '2017-06-22 03:18:09'::timestamp without time zone)) OR ((status = 1) AND
> (started_at < '2017-06-23 03:11:09'::timestamp without time zone)))
>  Planning time: 40.734 ms
>  Execution time: 913.638 ms
> (9 rows)
>
>
> I see that query still went through the Seq Scan instead of Index Scan. Is
> it due to poorly crafted index or because of query structure? Is it
> possible to make this query faster?
>
>
> Thanks
>
Well, most of the time is spent ordering, and it is doing a (slow) disk sort. Try increasing work_mem for a in-memory
sort.

How many rows in campaign_jobs? If the query is returning most of the rows in the table, it will not going to use any
indexanyway. 

HTH
Gerardo


Re: [PERFORM]

From
Brad DeJong
Date:

On 2017-06-28, Pavel Stehule wrote ...
> On 2017-06-28, Yevhenii Kurtov wrote ...
>> On 2017-06-28, Pavel Stehule wrote ...
>>> On 2017-06-28, Yevhenii Kurtov wrote ...
>>>> We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The
queryis:
 
>>>> ...
>>>> I added following index: CREATE INDEX ON campaign_jobs(id, status, failed_at, started_at, priority DESC,
times_failed);
>>>> ...
>>> There are few issues 
>>> a) parametrized LIMIT
>>> b) complex predicate with lot of OR 
>>> c) slow external sort
>>>
>>> b) signalize maybe some strange in design .. try to replace "OR" by "UNION" query
>>> c) if you can and you have good enough memory .. try to increase work_mem .. maybe 20MB
>>>
>>> if you change query to union queries, then you can use conditional indexes
>>>
>>> create index(id) where status = 0;
>>> create index(failed_at) where status = 2;
>>> create index(started_at) where status = 1;
>>
>> Can you please give a tip how to rewrite the query with UNION clause?
>
> SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $3) AND (c0."failed_at" > $4))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $5) AND (c0."started_at" < $6))
> ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT $7
> FOR UPDATE SKIP LOCKED


Normally (at least for developers I've worked with), that kind of query structure is used when the "status" values
don'toverlap and don't change from query to query. Judging from Pavel's suggested conditional indexes (i.e. "where
status= <constant>"), he also thinks that is likely.
 

Give the optimizer that information so that it can use it. Assuming $1 = 0 and $3 = 2 and $5 = 1, substitute literals.
Substituteliteral for $7 in limit. Push order by and limit to each branch of the union all (or does Postgres figure
thatout automatically?) Replace union with union all (not sure about Postgres, but allows other dbms to avoid sorting
andmerging result sets to eliminate duplicates). (Use of UNION ALL assumes that "id" is unique across rows as implied
byonly "id" being selected with FOR UPDATE. If multiple rows can have the same "id", then use UNION to eliminate the
duplicates.)

SELECT "id" FROM "campaign_jobs" WHERE "status" = 0 AND NOT "id" = ANY($1)
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 2 AND "failed_at" > $2
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 1 AND "started_at" < $3
ORDER BY "priority" DESC, "times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED


Another thing that you could try is to push the ORDER BY and LIMIT to the branches of the UNION (or does Postgres
figurethat out automatically?) and use slightly different indexes. This may not make sense for all the branches but one
nicething about UNION is that each branch can be tweaked independently. Also, there are probably unmentioned functional
dependenciesthat you can use to reduce the index size and/or improve your match rate. Example - if status = 1 means
thatthe campaign_job has started but not failed or completed, then you may know that started_at is set, but failed_at
andended_at are null. The < comparison in and of itself implies that only rows where "started_at" is not null will
matchthe condition.
 

SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE (((c0."status" = 0) AND NOT (c0."id" = ANY($1)))) ORDER BY
c0."priority"DESC, c0."times_failed" LIMIT 100
 
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 2) AND (c0."failed_at" > $2)) ORDER BY c0."priority"
DESC,c0."times_failed" LIMIT 100
 
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 1) AND (c0."started_at" < $3)) ORDER BY c0."priority"
DESC,c0."times_failed" LIMIT 100
 
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED

Including the "priority", "times_failed" and "id" columns in the indexes along with "failed_at"/"started_at" allows the
optimizerto do index only scans. (May still have to do random I/O to the data page to determine tuple version
visibilitybut I don't think that can be eliminated.)
 

create index ... ("priority" desc, "times_failed", "id")               where "status" = 0;
create index ... ("priority" desc, "times_failed", "id", "failed_at")  where "status" = 2 and "failed_at" is not null;
create index ... ("priority" desc, "times_failed", "id", "started_at") where "status" = 1 and "started_at" is not null;
--and ended_at is null and ...
 


I'm assuming that the optimizer knows that "where status = 1 and started_at < $3" implies "and started_at is not null"
andwill consider the conditional index. If not, then the "and started_at is not null" needs to be explicit.
 

Re: [PERFORM]

From
Yevhenii Kurtov
Date:
Hello folks,

Thank you very much for analysis and suggested - there is a lot to learn here. I just  tried UNION queries and got following error:

ERROR:  FOR UPDATE is not allowed with UNION/INTERSECT/EXCEPT

I made a table dump for anyone who wants to give it a spin https://app.box.com/s/464b12glmlk5o4gvzz7krc4c8s2fxlwr
and here is the gist for the original commands https://gist.github.com/lessless/33215d0c147645db721e74e07498ac53

On Wed, Jun 28, 2017 at 8:10 PM, Brad DeJong <Brad.Dejong@infor.com> wrote:


On 2017-06-28, Pavel Stehule wrote ...
> On 2017-06-28, Yevhenii Kurtov wrote ...
>> On 2017-06-28, Pavel Stehule wrote ...
>>> On 2017-06-28, Yevhenii Kurtov wrote ...
>>>> We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The query is:
>>>> ...
>>>> I added following index: CREATE INDEX ON campaign_jobs(id, status, failed_at, started_at, priority DESC, times_failed);
>>>> ...
>>> There are few issues
>>> a) parametrized LIMIT
>>> b) complex predicate with lot of OR
>>> c) slow external sort
>>>
>>> b) signalize maybe some strange in design .. try to replace "OR" by "UNION" query
>>> c) if you can and you have good enough memory .. try to increase work_mem .. maybe 20MB
>>>
>>> if you change query to union queries, then you can use conditional indexes
>>>
>>> create index(id) where status = 0;
>>> create index(failed_at) where status = 2;
>>> create index(started_at) where status = 1;
>>
>> Can you please give a tip how to rewrite the query with UNION clause?
>
> SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $3) AND (c0."failed_at" > $4))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $5) AND (c0."started_at" < $6))
> ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT $7
> FOR UPDATE SKIP LOCKED


Normally (at least for developers I've worked with), that kind of query structure is used when the "status" values don't overlap and don't change from query to query. Judging from Pavel's suggested conditional indexes (i.e. "where status = <constant>"), he also thinks that is likely.

Give the optimizer that information so that it can use it. Assuming $1 = 0 and $3 = 2 and $5 = 1, substitute literals. Substitute literal for $7 in limit. Push order by and limit to each branch of the union all (or does Postgres figure that out automatically?) Replace union with union all (not sure about Postgres, but allows other dbms to avoid sorting and merging result sets to eliminate duplicates). (Use of UNION ALL assumes that "id" is unique across rows as implied by only "id" being selected with FOR UPDATE. If multiple rows can have the same "id", then use UNION to eliminate the duplicates.)

SELECT "id" FROM "campaign_jobs" WHERE "status" = 0 AND NOT "id" = ANY($1)
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 2 AND "failed_at" > $2
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 1 AND "started_at" < $3
ORDER BY "priority" DESC, "times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED


Another thing that you could try is to push the ORDER BY and LIMIT to the branches of the UNION (or does Postgres figure that out automatically?) and use slightly different indexes. This may not make sense for all the branches but one nice thing about UNION is that each branch can be tweaked independently. Also, there are probably unmentioned functional dependencies that you can use to reduce the index size and/or improve your match rate. Example - if status = 1 means that the campaign_job has started but not failed or completed, then you may know that started_at is set, but failed_at and ended_at are null. The < comparison in and of itself implies that only rows where "started_at" is not null will match the condition.

SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE (((c0."status" = 0) AND NOT (c0."id" = ANY($1)))) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 2) AND (c0."failed_at" > $2)) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 1) AND (c0."started_at" < $3)) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED

Including the "priority", "times_failed" and "id" columns in the indexes along with "failed_at"/"started_at" allows the optimizer to do index only scans. (May still have to do random I/O to the data page to determine tuple version visibility but I don't think that can be eliminated.)

create index ... ("priority" desc, "times_failed", "id")               where "status" = 0;
create index ... ("priority" desc, "times_failed", "id", "failed_at")  where "status" = 2 and "failed_at" is not null;
create index ... ("priority" desc, "times_failed", "id", "started_at") where "status" = 1 and "started_at" is not null; -- and ended_at is null and ...


I'm assuming that the optimizer knows that "where status = 1 and started_at < $3" implies "and started_at is not null" and will consider the conditional index. If not, then the "and started_at is not null" needs to be explicit.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM]

From
Pavel Stehule
Date:


2017-06-29 7:17 GMT+02:00 Yevhenii Kurtov <yevhenii.kurtov@gmail.com>:
Hello folks,

Thank you very much for analysis and suggested - there is a lot to learn here. I just  tried UNION queries and got following error:

ERROR:  FOR UPDATE is not allowed with UNION/INTERSECT/EXCEPT

it is sad :(

maybe bitmap index scan can work

postgres=# create table test(id int, started date, failed date, status int);
CREATE TABLE
postgres=# create index on test(id) where status = 0;
CREATE INDEX
postgres=# create index on test(started) where status = 1;
CREATE INDEX
postgres=# create index on test(failed ) where status = 2;
CREATE INDEX
postgres=# explain select id from test where (status = 0 and id in (1,2,3,4,5)) or (status = 1 and started < current_date) or (status = 2 and failed > current_date);
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│                                                                               QUERY PLAN                                           
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
│ Bitmap Heap Scan on test  (cost=12.93..22.50 rows=6 width=4)                                                                       
│   Recheck Cond: (((id = ANY ('{1,2,3,4,5}'::integer[])) AND (status = 0)) OR ((started < CURRENT_DATE) AND (status = 1)) OR ((faile
│   Filter: (((status = 0) AND (id = ANY ('{1,2,3,4,5}'::integer[]))) OR ((status = 1) AND (started < CURRENT_DATE)) OR ((status = 2)
│   ->  BitmapOr  (cost=12.93..12.93 rows=6 width=0)                                                                                 
│         ->  Bitmap Index Scan on test_id_idx  (cost=0.00..4.66 rows=1 width=0)                                                     
│               Index Cond: (id = ANY ('{1,2,3,4,5}'::integer[]))                                                                    
│         ->  Bitmap Index Scan on test_started_idx  (cost=0.00..4.13 rows=3 width=0)                                                
│               Index Cond: (started < CURRENT_DATE)                                                                                 
│         ->  Bitmap Index Scan on test_failed_idx  (cost=0.00..4.13 rows=3 width=0)                                                 
│               Index Cond: (failed > CURRENT_DATE)                                                                                  
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(10 rows)

  

I made a table dump for anyone who wants to give it a spin https://app.box.com/s/464b12glmlk5o4gvzz7krc4c8s2fxlwr
and here is the gist for the original commands https://gist.github.com/lessless/33215d0c147645db721e74e07498ac53

On Wed, Jun 28, 2017 at 8:10 PM, Brad DeJong <Brad.Dejong@infor.com> wrote:


On 2017-06-28, Pavel Stehule wrote ...
> On 2017-06-28, Yevhenii Kurtov wrote ...
>> On 2017-06-28, Pavel Stehule wrote ...
>>> On 2017-06-28, Yevhenii Kurtov wrote ...
>>>> We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The query is:
>>>> ...
>>>> I added following index: CREATE INDEX ON campaign_jobs(id, status, failed_at, started_at, priority DESC, times_failed);
>>>> ...
>>> There are few issues
>>> a) parametrized LIMIT
>>> b) complex predicate with lot of OR
>>> c) slow external sort
>>>
>>> b) signalize maybe some strange in design .. try to replace "OR" by "UNION" query
>>> c) if you can and you have good enough memory .. try to increase work_mem .. maybe 20MB
>>>
>>> if you change query to union queries, then you can use conditional indexes
>>>
>>> create index(id) where status = 0;
>>> create index(failed_at) where status = 2;
>>> create index(started_at) where status = 1;
>>
>> Can you please give a tip how to rewrite the query with UNION clause?
>
> SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $3) AND (c0."failed_at" > $4))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $5) AND (c0."started_at" < $6))
> ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT $7
> FOR UPDATE SKIP LOCKED


Normally (at least for developers I've worked with), that kind of query structure is used when the "status" values don't overlap and don't change from query to query. Judging from Pavel's suggested conditional indexes (i.e. "where status = <constant>"), he also thinks that is likely.

Give the optimizer that information so that it can use it. Assuming $1 = 0 and $3 = 2 and $5 = 1, substitute literals. Substitute literal for $7 in limit. Push order by and limit to each branch of the union all (or does Postgres figure that out automatically?) Replace union with union all (not sure about Postgres, but allows other dbms to avoid sorting and merging result sets to eliminate duplicates). (Use of UNION ALL assumes that "id" is unique across rows as implied by only "id" being selected with FOR UPDATE. If multiple rows can have the same "id", then use UNION to eliminate the duplicates.)

SELECT "id" FROM "campaign_jobs" WHERE "status" = 0 AND NOT "id" = ANY($1)
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 2 AND "failed_at" > $2
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 1 AND "started_at" < $3
ORDER BY "priority" DESC, "times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED


Another thing that you could try is to push the ORDER BY and LIMIT to the branches of the UNION (or does Postgres figure that out automatically?) and use slightly different indexes. This may not make sense for all the branches but one nice thing about UNION is that each branch can be tweaked independently. Also, there are probably unmentioned functional dependencies that you can use to reduce the index size and/or improve your match rate. Example - if status = 1 means that the campaign_job has started but not failed or completed, then you may know that started_at is set, but failed_at and ended_at are null. The < comparison in and of itself implies that only rows where "started_at" is not null will match the condition.

SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE (((c0."status" = 0) AND NOT (c0."id" = ANY($1)))) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 2) AND (c0."failed_at" > $2)) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 1) AND (c0."started_at" < $3)) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED

Including the "priority", "times_failed" and "id" columns in the indexes along with "failed_at"/"started_at" allows the optimizer to do index only scans. (May still have to do random I/O to the data page to determine tuple version visibility but I don't think that can be eliminated.)

create index ... ("priority" desc, "times_failed", "id")               where "status" = 0;
create index ... ("priority" desc, "times_failed", "id", "failed_at")  where "status" = 2 and "failed_at" is not null;
create index ... ("priority" desc, "times_failed", "id", "started_at") where "status" = 1 and "started_at" is not null; -- and ended_at is null and ...


I'm assuming that the optimizer knows that "where status = 1 and started_at < $3" implies "and started_at is not null" and will consider the conditional index. If not, then the "and started_at is not null" needs to be explicit.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM]

From
Jeff Janes
Date:
On Tue, Jun 27, 2017 at 11:47 PM, Yevhenii Kurtov <yevhenii.kurtov@gmail.com> wrote:
Hello,

We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The query is: 

SELECT c0."id" FROM "campaign_jobs" AS c0
WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
OR ((c0."status" = $3) AND (c0."failed_at" > $4))
OR ((c0."status" = $5) AND (c0."started_at" < $6))
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT $7
FOR UPDATE SKIP LOCKED


 

I see that query still went through the Seq Scan instead of Index Scan. Is it due to poorly crafted index or because of query structure? Is it possible to make this query faster?

An index on (priority desc, times_failed) should speed this up massively.  Might want to include status at the end as well. However, your example data is not terribly realistic.

What version of PostgreSQL are you using?

Cheers,

Jeff

Re: [PERFORM]

From
Yevhenii Kurtov
Date:
Hi Jeff,

That is just a sample data, we are going live in Jun and I don't have anything real so far. Right now it's 9.6 and it will be a latest stable available release on the date that we go live. 

On Fri, Jun 30, 2017 at 1:11 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Jun 27, 2017 at 11:47 PM, Yevhenii Kurtov <yevhenii.kurtov@gmail.com> wrote:
Hello,

We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The query is: 

SELECT c0."id" FROM "campaign_jobs" AS c0
WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
OR ((c0."status" = $3) AND (c0."failed_at" > $4))
OR ((c0."status" = $5) AND (c0."started_at" < $6))
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT $7
FOR UPDATE SKIP LOCKED


 

I see that query still went through the Seq Scan instead of Index Scan. Is it due to poorly crafted index or because of query structure? Is it possible to make this query faster?

An index on (priority desc, times_failed) should speed this up massively.  Might want to include status at the end as well. However, your example data is not terribly realistic.

What version of PostgreSQL are you using?

Cheers,

Jeff

Re: [PERFORM]

From
Alvaro Herrera
Date:
Pavel Stehule wrote:
> 2017-06-29 7:17 GMT+02:00 Yevhenii Kurtov <yevhenii.kurtov@gmail.com>:
>

> > I just  tried UNION queries and got following error:
> >
> > ERROR:  FOR UPDATE is not allowed with UNION/INTERSECT/EXCEPT
>
> it is sad :(

I think we could lift this restriction for UNION ALL, but UNION sounds
difficult.


BTW I wonder how much of the original problem is caused by using a
prepared query.  I understand the desire to avoid repeated planning
work, but I think in this case it may be working against you.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PERFORM]

From
Scott Marlowe
Date:
On Thu, Jun 29, 2017 at 1:11 PM, Yevhenii Kurtov
<yevhenii.kurtov@gmail.com> wrote:
> Hi Jeff,
>
> That is just a sample data, we are going live in Jun and I don't have
> anything real so far. Right now it's 9.6 and it will be a latest stable
> available release on the date that we go live.

Trust me on this one, you want to get some realistic fake data in
there, and in realistic quantities before you go live to test.
Postgresql's planner makes decisions based on size of the data it has
to trundle through and statistical analysis of the data in the tables
etc. You don't wanna go from 500 rows in a test table with the same
values to 10,000,000 rows with wildly varying data in production
without having some clue where that db is gonna be headed performance
wise.


Re: [PERFORM]

From
Jeff Janes
Date:
On Thu, Jun 29, 2017 at 12:11 PM, Yevhenii Kurtov <yevhenii.kurtov@gmail.com> wrote:
Hi Jeff,

That is just a sample data, we are going live in Jun and I don't have anything real so far. Right now it's 9.6 and it will be a latest stable available release on the date that we go live. 


You need to use your knowledge of the application to come up with some plausible sample data.

What happens when something succeeds?  Does it get deleted from the table, or does it get retained but with a certain value of the status column?  If it is retained, what happens to the priority and times_failed fields?

The performance of your queuing table will critically depend on that.  

If you need to keep it once it succeeds, you should probably do that by deleting it from the queuing table and inserting it into a history table.  It is much easier to keep performance up with that kind of design.

Cheers,

Jeff