Re: [HACKERS] CLUSTER command progress monitor - Mailing list pgsql-hackers

From Antonin Houska
Subject Re: [HACKERS] CLUSTER command progress monitor
Date
Msg-id 18222.1511255651@localhost
Whole thread Raw
In response to Re: [HACKERS] CLUSTER command progress monitor  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Antonin Houska <ah@cybertec.at> writes:
> > Robert Haas <robertmhaas@gmail.com> wrote:
> >> These two phases overlap, though. I believe progress reporting for
> >> sorts is really hard.
>
> > Whatever complexity is hidden in the sort, cost_sort() should have taken it
> > into consideration when called via plan_cluster_use_sort(). Thus I think that
> > once we have both startup and total cost, the current progress of the sort
> > stage can be estimated from the current number of input and output
> > rows. Please remind me if my proposal appears to be too simplistic.
>
> Well, even if you assume that the planner's cost model omits nothing
> (which I wouldn't bet on), its result is only going to be as good as the
> planner's estimate of the number of rows to be sorted.  And, in cases
> where people actually care about progress monitoring, it's likely that
> the planner got that wrong, maybe horribly so.  I think it's a bad idea
> for progress monitoring to depend on the planner's estimates in any way
> whatsoever.

The general idea was that some sort of prediction of the total cost is needed
anyway if we should tell during execution what fraction of work has already
been done. And also that the cost computation that we perform during execution
shouldn't (ideally) differ from cost_sort(). So I thought that it's easier to
refine cost_sort() than to implement the same computation from scratch
elsewhere.

Besides that I see 2 circumstances that make the estimate of the number of
input tuples simpler in the CLUSTER case:

* There's only 1 input relation w/o any kind of clause.

* CLUSTER uses SnapshotAny, so pg_class(reltuples) is closer to the actual number of input rows than it would be in
generalcase. (Of course, pg_class would only be useful for the initial estimate.) 

Unlike planner, the executor could recalculate the cost estimate at some
point(s) as it recognizes that the actual number of tuples per page appears to
differ from the density derived from pg_class initially. Still wrong?

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Failed to delete old ReorderBuffer spilled files
Next
From: Daniel Gustafsson
Date:
Subject: Re: Anybody care about having the verbose form of the tzdata files?