Re: [PERFORM] - Mailing list pgsql-performance

From Brad DeJong
Subject Re: [PERFORM]
Date
Msg-id CY1PR0201MB189792E39DD5C36B59EEC362FFDD0@CY1PR0201MB1897.namprd02.prod.outlook.com
Whole thread Raw
In response to Re: [PERFORM]  (Pavel Stehule <pavel.stehule@gmail.com>)
Responses Re: [PERFORM]  (Yevhenii Kurtov <yevhenii.kurtov@gmail.com>)
List pgsql-performance

On 2017-06-28, Pavel Stehule wrote ...
> On 2017-06-28, Yevhenii Kurtov wrote ...
>> On 2017-06-28, Pavel Stehule wrote ...
>>> On 2017-06-28, Yevhenii Kurtov wrote ...
>>>> We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The
queryis:
 
>>>> ...
>>>> I added following index: CREATE INDEX ON campaign_jobs(id, status, failed_at, started_at, priority DESC,
times_failed);
>>>> ...
>>> There are few issues 
>>> a) parametrized LIMIT
>>> b) complex predicate with lot of OR 
>>> c) slow external sort
>>>
>>> b) signalize maybe some strange in design .. try to replace "OR" by "UNION" query
>>> c) if you can and you have good enough memory .. try to increase work_mem .. maybe 20MB
>>>
>>> if you change query to union queries, then you can use conditional indexes
>>>
>>> create index(id) where status = 0;
>>> create index(failed_at) where status = 2;
>>> create index(started_at) where status = 1;
>>
>> Can you please give a tip how to rewrite the query with UNION clause?
>
> SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $3) AND (c0."failed_at" > $4))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $5) AND (c0."started_at" < $6))
> ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT $7
> FOR UPDATE SKIP LOCKED


Normally (at least for developers I've worked with), that kind of query structure is used when the "status" values
don'toverlap and don't change from query to query. Judging from Pavel's suggested conditional indexes (i.e. "where
status= <constant>"), he also thinks that is likely.
 

Give the optimizer that information so that it can use it. Assuming $1 = 0 and $3 = 2 and $5 = 1, substitute literals.
Substituteliteral for $7 in limit. Push order by and limit to each branch of the union all (or does Postgres figure
thatout automatically?) Replace union with union all (not sure about Postgres, but allows other dbms to avoid sorting
andmerging result sets to eliminate duplicates). (Use of UNION ALL assumes that "id" is unique across rows as implied
byonly "id" being selected with FOR UPDATE. If multiple rows can have the same "id", then use UNION to eliminate the
duplicates.)

SELECT "id" FROM "campaign_jobs" WHERE "status" = 0 AND NOT "id" = ANY($1)
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 2 AND "failed_at" > $2
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 1 AND "started_at" < $3
ORDER BY "priority" DESC, "times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED


Another thing that you could try is to push the ORDER BY and LIMIT to the branches of the UNION (or does Postgres
figurethat out automatically?) and use slightly different indexes. This may not make sense for all the branches but one
nicething about UNION is that each branch can be tweaked independently. Also, there are probably unmentioned functional
dependenciesthat you can use to reduce the index size and/or improve your match rate. Example - if status = 1 means
thatthe campaign_job has started but not failed or completed, then you may know that started_at is set, but failed_at
andended_at are null. The < comparison in and of itself implies that only rows where "started_at" is not null will
matchthe condition.
 

SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE (((c0."status" = 0) AND NOT (c0."id" = ANY($1)))) ORDER BY
c0."priority"DESC, c0."times_failed" LIMIT 100
 
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 2) AND (c0."failed_at" > $2)) ORDER BY c0."priority"
DESC,c0."times_failed" LIMIT 100
 
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 1) AND (c0."started_at" < $3)) ORDER BY c0."priority"
DESC,c0."times_failed" LIMIT 100
 
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED

Including the "priority", "times_failed" and "id" columns in the indexes along with "failed_at"/"started_at" allows the
optimizerto do index only scans. (May still have to do random I/O to the data page to determine tuple version
visibilitybut I don't think that can be eliminated.)
 

create index ... ("priority" desc, "times_failed", "id")               where "status" = 0;
create index ... ("priority" desc, "times_failed", "id", "failed_at")  where "status" = 2 and "failed_at" is not null;
create index ... ("priority" desc, "times_failed", "id", "started_at") where "status" = 1 and "started_at" is not null;
--and ended_at is null and ...
 


I'm assuming that the optimizer knows that "where status = 1 and started_at < $3" implies "and started_at is not null"
andwill consider the conditional index. If not, then the "and started_at is not null" needs to be explicit.
 

pgsql-performance by date:

Previous
From: Gerardo Herzig
Date:
Subject: Re: [PERFORM]
Next
From: Merlin Moncure
Date:
Subject: Re: [PERFORM] Efficiently merging and sorting collections of sorted rows