Home > mailing lists

Re: [HACKERS] CLUSTER command progress monitor - Mailing list pgsql-hackers

From	Antonin Houska
Subject	Re: [HACKERS] CLUSTER command progress monitor
Date	November 21, 2017 12:14:11
Msg-id	18222.1511255651@localhost Whole thread Raw
In response to	Re: [HACKERS] CLUSTER command progress monitor (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Antonin Houska <ah@cybertec.at> writes:
> > Robert Haas <robertmhaas@gmail.com> wrote:
> >> These two phases overlap, though. I believe progress reporting for
> >> sorts is really hard.
>
> > Whatever complexity is hidden in the sort, cost_sort() should have taken it
> > into consideration when called via plan_cluster_use_sort(). Thus I think that
> > once we have both startup and total cost, the current progress of the sort
> > stage can be estimated from the current number of input and output
> > rows. Please remind me if my proposal appears to be too simplistic.
>
> Well, even if you assume that the planner's cost model omits nothing
> (which I wouldn't bet on), its result is only going to be as good as the
> planner's estimate of the number of rows to be sorted.  And, in cases
> where people actually care about progress monitoring, it's likely that
> the planner got that wrong, maybe horribly so.  I think it's a bad idea
> for progress monitoring to depend on the planner's estimates in any way
> whatsoever.

The general idea was that some sort of prediction of the total cost is needed
anyway if we should tell during execution what fraction of work has already
been done. And also that the cost computation that we perform during execution
shouldn't (ideally) differ from cost_sort(). So I thought that it's easier to
refine cost_sort() than to implement the same computation from scratch
elsewhere.

Besides that I see 2 circumstances that make the estimate of the number of
input tuples simpler in the CLUSTER case:

* There's only 1 input relation w/o any kind of clause.

* CLUSTER uses SnapshotAny, so pg_class(reltuples) is closer to the actual number of input rows than it would be in
generalcase. (Of course, pg_class would only be useful for the initial estimate.) 

Unlike planner, the executor could recalculate the cost estimate at some
point(s) as it recognizes that the actual number of tuples per page appears to
differ from the density derived from pg_class initially. Still wrong?

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

pgsql-hackers by date:

From: Masahiko Sawada
Date: 21 November 2017, 12:12:29
Subject: Re: Failed to delete old ReorderBuffer spilled files

From: Daniel Gustafsson
Date: 21 November 2017, 12:28:34
Subject: Re: Anybody care about having the verbose form of the tzdata files?

Re: [HACKERS] CLUSTER command progress monitor - Mailing list pgsql-hackers

Previous

Next