Re: Incremental Sort Cost Estimation Instability - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Incremental Sort Cost Estimation Instability
Date
Msg-id 9a5fb370-6c48-400d-a78e-e57fc0ff0379@vondra.me
Whole thread Raw
In response to Re: Incremental Sort Cost Estimation Instability  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Incremental Sort Cost Estimation Instability
Re: Incremental Sort Cost Estimation Instability
List pgsql-hackers
On 9/12/24 12:12, David Rowley wrote:
> On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov <lepihov@gmail.com> wrote:
>> Initial problem causes wrong cost_sort estimation. Right now I think
>> about providing cost_sort() the sort clauses instead of (or in addition
>> to) the pathkeys.
> 
> I'm not quite sure why the sort clauses matter any more than the
> EquivalenceClass.  If the EquivalanceClass defines that all members
> will have the same value for any given row, then, if we had to choose
> any single member to drive the n_distinct estimate from, isn't the
> most accurate distinct estimate from the member with the smallest
> n_distinct estimate?  (That assumes the less distinct member has every
> value the more distinct member has, which might not be true)
> 
> David
> 

How large can the cost difference get? My assumption was it's correlated
with how different the ndistincts are for the two sides, so I tried

  CREATE TABLE t1(x integer, y integer,z text);
  CREATE TABLE t2(x integer, y integer,z text);

  INSERT INTO t1 (x,y) SELECT x, 1
    FROM generate_series(1,1000000) AS x;
  INSERT INTO t2 (x,y) SELECT mod(x,1000), 1
    FROM generate_series(1,1000000) AS x;

  CREATE INDEX ON t1(x);
  CREATE INDEX ON t2(x);
  CREATE INDEX ON t1(y);
  CREATE INDEX ON t2(y);

  VACUUM ANALYZE;

Which changes the ndistinct for t2.x from 1M to 1k. I've expected the
cost difference to get much larger, but in it does not really change:

GroupAggregate  (cost=38.99..37886.88 rows=992 width=16) (actual rows=1
loops=1)

GroupAggregate  (cost=37874.26..37904.04 rows=992 width=16) (actual
rows=1 loops=1)

That is pretty significant - the total cost difference is tiny, I'd even
say it does not matter in practice (seems well within possible impact of
collecting a different random sample).

But the startup cost changes in rather absurd way, while the rest of the
plan is exactly the same. We even know this:

   ->  Incremental Sort  (cost=38.99..37869.52 rows=992 width=8)
         Sort Key: t1.x, t1.y
         Presorted Key: t1.x

in both cases. There's literally no other difference between these plans
visible in the explain ...


I'm not sure how to fix this, but it seems estimate_num_groups() needs
to do things differently. And I agree looking for the minimum ndistinct
seems like the right approach. but doesn't estimate_num_groups()
supposed to already do that? The comment says:

 * 3.  If the list contains Vars of different relations that are known equal
 * due to equivalence classes, then drop all but one of the Vars from each
 * known-equal set, keeping the one with smallest estimated # of values
 * (since the extra values of the others can't appear in joined rows).
 * Note the reason we only consider Vars of different relations is that
 * if we considered ones of the same rel, we'd be double-counting the
 * restriction selectivity of the equality in the next step.

I haven't debugged this, but how come this doesn't do the trick?


regards

-- 
Tomas Vondra



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: [PATCH] Support Int64 GUCs
Next
From: Alena Rybakina
Date:
Subject: may be a mismatch between the construct_array and construct_md_array comments