Re: postgres chooses objectively wrong index - Mailing list pgsql-performance

From Merlin Moncure
Subject Re: postgres chooses objectively wrong index
Date
Msg-id CAHyXU0yGKAvREkT47Oiw19xhicA2=peCSYc235s1hhna==BAVQ@mail.gmail.com
Whole thread
In response to Re: postgres chooses objectively wrong index  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: postgres chooses objectively wrong index
List pgsql-performance
On Tue, Mar 17, 2026 at 4:16 PM Alexey Ermakov <alexius.work@gmail.com> wrote:
On 2026-03-18 03:01, Merlin Moncure wrote:
I've been maintaining an airflow style orchestrator in pl/pgsql, and it's revealed a performance issue I just can't solve.  There is a table, task, which may normally contain billions of rows, but only a tiny portion is interesting for specific reasons—a common pattern in task-type systems.

...

I'm wondering if there are other tricks that might apply here, for example, multi column index statistics...curious if anyone has thoughts on that.

Any suggestions?

merlin

Hello. I think planner doesn't have information about distribution of async.task_execution_state(task) unless it's part of any full index. I would try to give that with extended statistics (postgresql 14+):

create statistics (mcv) task_task_execution_state_stat on ((async.task_execution_state(task))) from async.task;
analyze async.task;

If that won't help - please show distribution from pg_stats_ext view for extended statistic above.


This unfortunately fails, probably because the table type includes system columns (despite not using them).
 
orchestrator_service_user@orchestrator=> create statistics  task_stats (mcv) on (async.task_execution_state(task)) from async.task;
ERROR:  statistics creation on system columns is not supported

This would require some refactoring to fix. 

On Tue, Mar 17, 2026 at 4:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Merlin Moncure <mmoncure@gmail.com> writes:
> I've been maintaining an airflow style orchestrator in pl/pgsql, and it's
> revealed a performance issue I just can't solve.  There is a table, task,
> which may normally contain billions of rows, but only a tiny portion is
> interesting for specific reasons—a common pattern in task-type systems.
> ...

> Usually, we get a plan that looks like this:

>  Limit  (cost=0.38..39.74 rows=10 width=563) (actual time=0.054..0.054 rows=0 loops=1)
>    ->  Index Scan using task_concurrency_pool_priority_entered_idx on task  (cost=0.38..705.08 rows=179 width=563) (actual time=0.053..0.053 rows=0 loops=1)

> Sometimes, based on a certain data distribution, we get results like this:

> Limit  (cost=25.75..25.78 rows=10 width=563) (actual time=8.909..8.911 rows=0 loops=1)
>   ->  Sort  (cost=25.75..26.20 rows=179 width=563) (actual time=8.908..8.909 rows=0 loops=1)

I think the fundamental problem here is that the planner is estimating
179 matching rows when the true count is 0.  Getting that estimate
down by, say, an order of magnitude would probably fix your issue.
However, if the selectivity is already epsilon (are there really
billions of rows?) it may be hard to get it down to a smaller epsilon.
What statistics target are you using?

Potentially yes.  Maybe 40m in this particular database.

It's set to default, so it isn't very precise.  Is my earlier point correct, though? No distribution of data should prefer that plan (barring some low row count seqscan stuff)?  Let's say the row count was 179 rows, it would make no difference in the disparity (in fact, it'd probably be worse).

Simplified, the query is:
SELECT * FROM foo WHERE a=? AND b=K ORDER BY c, d LIMIT N;
CREATE INDEX ON foo(a,b,c) WHERE b=K;

why choose any other index? I was guessing mcv stats problem, but this can be proved out without stats IMO.
 
How often do tasks change state?  

This is typical FIFO task processing system, pgmq, etc, with a huge number of processed rows. and a small number of "processing" rows that get staged and then complete.  Loads are highly transient; unprocessed rows may surge up to millions before trending to zero.   This naturally puts a lot of stress on statistics.   Tasks often resolve in seconds or minutes, depending on depth of queue. 

Could it be reasonable to partition the task table on state, rather than rely on an index?

I've thought about this; the basic issue is that the flow module extends async.task with a BEFORE trigger.  This can be worked around but not easily.   This is my drop back and punt option, but I'm curious if there is an underlying solve here.

merlin
 

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: postgres chooses objectively wrong index
Next
From: Alexey Ermakov
Date:
Subject: Re: postgres chooses objectively wrong index