Thread: Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
"Jim Cox"
Date:
On Thu, Oct 9, 2008 at 9:37 AM, Jim Cox <shakahshakah@gmail.com> wrote:
Is anyone working the "CLUSTER: Add VERBOSE option..." TODO item listed
on the PostgreSQL Wiki? If not, would it be wise for me to use
VERBOSE handling in an existing command (e.g. VACUUM)
as a guide while adding VERBOSE to CLUSTER?


A patch s/b attached which adds a "VERBOSE" option to the CLUSTER command as
mentioned in the following TODO item for CLUSTER: "Add VERBOSE option
to report tables as they are processed, like VACUUM VERBOSE".

In short, all three variations of the CLUSTER command now take an optional
"VERBOSE" arg, if present an INFO message is generated which displays
the schema.tblname just before actual clustering is kicked off (see example below).

postgres=# CLUSTER ;
CLUSTER

postgres=# CLUSTER VERBOSE ;
INFO:  clustering "public.my_b"
INFO:  clustering "public.my_c"
INFO:  clustering "public.my_a"
CLUSTER

postgres=# CLUSTER public.my_c ;
CLUSTER

postgres=# CLUSTER public.my_c VERBOSE ;
INFO:  clustering "public.my_c"
CLUSTER

Attachment

Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
"Kevin Grittner"
Date:
>>> "Jim Cox" <shakahshakah@gmail.com> wrote: 
> if present an INFO message is generated which displays
> the schema.tblname just before actual clustering is kicked off (see
example
> below).
> postgres=# CLUSTER VERBOSE ;
> INFO:  clustering "public.my_b"
> INFO:  clustering "public.my_c"
> INFO:  clustering "public.my_a"
> CLUSTER
Would it make sense to display the pg_total_relation_size before and
after?
-Kevin


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Heikki Linnakangas
Date:
Kevin Grittner wrote:
>>>> "Jim Cox" <shakahshakah@gmail.com> wrote: 
>> if present an INFO message is generated which displays
>> the schema.tblname just before actual clustering is kicked off (see
> example
>> below).
>  
>> postgres=# CLUSTER VERBOSE ;
>> INFO:  clustering "public.my_b"
>> INFO:  clustering "public.my_c"
>> INFO:  clustering "public.my_a"
>> CLUSTER
>  
> Would it make sense to display the pg_total_relation_size before and
> after?

Assuming you run CLUSTER as a replacement for VACUUM FULL, yes. More 
interesting would be a metric of "clusteredness", I think.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
"Jim Cox"
Date:
On Fri, Oct 10, 2008 at 10:23 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Kevin Grittner wrote:
"Jim Cox" <shakahshakah@gmail.com> wrote:
if present an INFO message is generated which displays
the schema.tblname just before actual clustering is kicked off (see
example
below).
 
postgres=# CLUSTER VERBOSE ;
INFO:  clustering "public.my_b"
INFO:  clustering "public.my_c"
INFO:  clustering "public.my_a"
CLUSTER
 Would it make sense to display the pg_total_relation_size before and
after?

Assuming you run CLUSTER as a replacement for VACUUM FULL, yes. More interesting would be a metric of "clusteredness", I think.

Something more like the following?

postgres=# CLUSTER VERBOSE ;
INFO:  clustering "public.my_b"
INFO:  complete, 0 rows scanned, 0 rows now live
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  clustering "public.my_c"
INFO:  complete, 20 rows scanned, 10 rows now live
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  clustering "public.my_a"
INFO:  complete, 10 rows scanned, 10 rows now live
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
CLUSTER

Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Heikki Linnakangas
Date:
Jim Cox wrote:
> On Fri, Oct 10, 2008 at 10:23 AM, Heikki Linnakangas <
> heikki.linnakangas@enterprisedb.com> wrote:
> 
>> Kevin Grittner wrote:
>>
>>> "Jim Cox" <shakahshakah@gmail.com> wrote:
>>>>> if present an INFO message is generated which displays
>>>> the schema.tblname just before actual clustering is kicked off (see
>>>>
>>> example
>>>
>>>> below).
>>>>
>>>
>>>> postgres=# CLUSTER VERBOSE ;
>>>> INFO:  clustering "public.my_b"
>>>> INFO:  clustering "public.my_c"
>>>> INFO:  clustering "public.my_a"
>>>> CLUSTER
>>>>
>>>  Would it make sense to display the pg_total_relation_size before and
>>> after?
>>>
>> Assuming you run CLUSTER as a replacement for VACUUM FULL, yes. More
>> interesting would be a metric of "clusteredness", I think.
>>
> 
> Something more like the following?
> 
> postgres=# CLUSTER VERBOSE ;
> INFO:  clustering "public.my_b"
> INFO:  complete, 0 rows scanned, 0 rows now live
> DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  clustering "public.my_c"
> INFO:  complete, 20 rows scanned, 10 rows now live
> DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  clustering "public.my_a"
> INFO:  complete, 10 rows scanned, 10 rows now live
> DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
> CLUSTER

No, I was thinking of something along the lines of:
INFO:  clustering "public.my_c"
INFO:  complete, was 33%, now 100% clustered

The only such measure that we have is the correlation, which isn't very 
good anyway, so I'm not sure if that's worthwhile.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> No, I was thinking of something along the lines of:
> INFO:  clustering "public.my_c"
> INFO:  complete, was 33%, now 100% clustered
> The only such measure that we have is the correlation, which isn't very 
> good anyway, so I'm not sure if that's worthwhile.

It'd be possible to count the number of order reversals during the
indexscan, ie the number of tuples with CTID lower than the previous
one's.  But I'm not sure how useful that number really is.  Also it's
not clear how to preserve such functionality if cluster is
re-implemented with a sort.
        regards, tom lane


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
"Jim Cox"
Date:
On Mon, Oct 13, 2008 at 8:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> No, I was thinking of something along the lines of:
> INFO:  clustering "public.my_c"
> INFO:  complete, was 33%, now 100% clustered
> The only such measure that we have is the correlation, which isn't very
> good anyway, so I'm not sure if that's worthwhile.

It'd be possible to count the number of order reversals during the
indexscan, ie the number of tuples with CTID lower than the previous
one's.  But I'm not sure how useful that number really is.  Also it's
not clear how to preserve such functionality if cluster is
re-implemented with a sort.

                       regards, tom lane

Another version of the patch should be attached, this time counting the number of "inversions" (pairs of tuples in the table that are in the wrong order) as a measure of the "sortedness" of the original data (scanned/live numbers still reported as an indication of the extent to which the table was vacuumed).

N.B. -- I'm not familiar enough with Postgres internals to know if the included inversion_count() method is a valid way to identify inversions.

In any case, example VERBOSE output:

postgres=# CLUSTER public.my_c VERBOSE ;
INFO:  clustering "public.my_c"
INFO:  complete, 15 tuples scanned, 10 tuples now live, 2 inversions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
CLUSTER

Attachment

Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Heikki Linnakangas
Date:
Jim Cox wrote:
> On Mon, Oct 13, 2008 at 8:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>> No, I was thinking of something along the lines of:
>>> INFO:  clustering "public.my_c"
>>> INFO:  complete, was 33%, now 100% clustered
>>> The only such measure that we have is the correlation, which isn't very
>>> good anyway, so I'm not sure if that's worthwhile.
>> It'd be possible to count the number of order reversals during the
>> indexscan, ie the number of tuples with CTID lower than the previous
>> one's.  But I'm not sure how useful that number really is.  

It will look bad for patterns like:
2
1
4
3
6
5
..

which for all practical purposes is just as good as a perfectly sorted 
table. So no, I don't think that's a very useful metric either without 
somehow taking caching effects into account.

> Another version of the patch should be attached, this time counting the
> number of "inversions" (pairs of tuples in the table that are in the wrong
> order) as a measure of the "sortedness" of the original data (scanned/live
> numbers still reported as an indication of the extent to which the table was
> vacuumed).

Until we have a better metric for "sortedness", my earlier suggestion to 
print it was probably a bad idea. If anything, should probably print the 
same correlation metric that ANALYZE calculates, so that it would at 
least match what the planner uses for decision-making.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Jeff Davis
Date:
On Mon, 2008-10-13 at 08:30 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > No, I was thinking of something along the lines of:
> > INFO:  clustering "public.my_c"
> > INFO:  complete, was 33%, now 100% clustered
> > The only such measure that we have is the correlation, which isn't very 
> > good anyway, so I'm not sure if that's worthwhile.
> 
> It'd be possible to count the number of order reversals during the
> indexscan, ie the number of tuples with CTID lower than the previous
> one's.  But I'm not sure how useful that number really is.  Also it's
> not clear how to preserve such functionality if cluster is
> re-implemented with a sort.
> 

I assume here you mean a CTID with a lower page number, as the line
pointer wouldn't make any difference, right?

I think it would be a useful metric to decide whether or not to use an
index scan (I don't know how easy it is to estimate this from a sample,
but a CLUSTER could clearly get an exact number). It would solve the
problem where synchronized scans used by pg_dump could result in poor
correlation on restore and therefore not choose index scans (which is
what prompted turning off sync scans for pg_dump).

Regards,Jeff Davis





Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Gregory Stark
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> Jim Cox wrote:
>> On Mon, Oct 13, 2008 at 8:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>>
>>> It'd be possible to count the number of order reversals during the
>>> indexscan, ie the number of tuples with CTID lower than the previous
>>> one's.  But I'm not sure how useful that number really is.  

Incidentally it finally occurred to me that "sortedness" is actually a pretty
good term to search on. I found several papers for estimating metrics of
sortedness from samples even. Though the best looks like it requires a sample
of size O(sqrt(n)) which is more than we currently take.

The two metrics which seem popular is either the length of the longest
subsequence which is sorted or the number of sorted subsequences. I think the
latter is equivalent to counting the inversions.

I didn't find any papers which claimed to present good ways to draw
conclusions based on these metrics but I only did a quick search. I imagine if
everyone is looking for ways to estimate them they they must be useful for
something...

For some reason my access to the ACM digital library stopped working. Does
anyone else have access?


> It will look bad for patterns like:
> 2
> 1
> 4
> 3
> 6
> 5
> ..

Hm, you could include some measure of how far the inversion goes -- but I
think that's counter-productive. Sure some of them will be cached but others
won't and that'll be equally bad regardless of how far back it goes.

> Until we have a better metric for "sortedness", my earlier suggestion to print
> it was probably a bad idea. If anything, should probably print the same
> correlation metric that ANALYZE calculates, so that it would at least match
> what the planner uses for decision-making.

I agree with that. I like the idea of printing a message though -- we should
just have it print the correlation for now and when we improve the stats we'll
print the new metric.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Until we have a better metric for "sortedness", my earlier suggestion to print
>> it was probably a bad idea. If anything, should probably print the same
>> correlation metric that ANALYZE calculates, so that it would at least match
>> what the planner uses for decision-making.

> I agree with that. I like the idea of printing a message though -- we should
> just have it print the correlation for now and when we improve the stats we'll
> print the new metric.

Short of actually running an ANALYZE, I'm not seeing a good way to
derive the same number it derives.
        regards, tom lane


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> I agree with that. I like the idea of printing a message though -- we should
>> just have it print the correlation for now and when we improve the stats we'll
>> print the new metric.
>
> Short of actually running an ANALYZE, I'm not seeing a good way to
> derive the same number it derives.

Well we could print the _old_ value at least. So if you run cluster
periodically you can see whether you're running it often enough.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Short of actually running an ANALYZE, I'm not seeing a good way to
>> derive the same number it derives.

> Well we could print the _old_ value at least.

+1 ... seems an appropriate amount of effort for the likely value.
        regards, tom lane


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
"Kevin Grittner"
Date:
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> Gregory Stark <stark@enterprisedb.com> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> writes:
>>> Short of actually running an ANALYZE, I'm not seeing a good way to
>>> derive the same number it derives.
> 
>> Well we could print the _old_ value at least.
> 
> +1 ... seems an appropriate amount of effort for the likely value.
That seems fine for sortedness, but am I the only one who would like
the verbose mode to show the bloat reduction?  Essentially, an INFO
line to show the same information you could get by bracketing the
CLUSTER with a couple SELECTs:
ccdev=# select pg_total_relation_size('"DbTranImageStatus"');pg_total_relation_size
------------------------                253952
(1 row)

ccdev=# cluster "DbTranImageStatus";
CLUSTER
ccdev=# select pg_total_relation_size('"DbTranImageStatus"');pg_total_relation_size
------------------------                 32768
(1 row)
-Kevin


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Joshua Drake
Date:
On Mon, 13 Oct 2008 15:34:04 -0500
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> ccdev=# select pg_total_relation_size('"DbTranImageStatus"');
>  pg_total_relation_size
> ------------------------
>                  253952
> (1 row)
> 
> ccdev=# cluster "DbTranImageStatus";
> CLUSTER
> ccdev=# select pg_total_relation_size('"DbTranImageStatus"');
>  pg_total_relation_size
> ------------------------
>                   32768
> (1 row)
>  
> -Kevin

Although I think it is an interesting bit of information, I find that
"if" I am going to be clustering, I have already done the above.


Sincerely,

Joshua D. Drake





-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/




Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Robert Treat
Date:
On Monday 13 October 2008 16:45:35 Joshua Drake wrote:
> On Mon, 13 Oct 2008 15:34:04 -0500
>
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> > ccdev=# select pg_total_relation_size('"DbTranImageStatus"');
> >  pg_total_relation_size
> > ------------------------
> >                  253952
> > (1 row)
> >
> > ccdev=# cluster "DbTranImageStatus";
> > CLUSTER
> > ccdev=# select pg_total_relation_size('"DbTranImageStatus"');
> >  pg_total_relation_size
> > ------------------------
> >                   32768
> > (1 row)
> >
> > -Kevin
>
> Although I think it is an interesting bit of information, I find that
> "if" I am going to be clustering, I have already done the above.
>
>

I agree with that, though I still wouldn't mind seeing some size specific 
information, similar to what vacuum verbose emits.. eg.

INFO:  "my_users_mods": removed 790 row versions in 4 pages
INFO:  "my_users_mods": found 790 removable, 308 nonremovable row versions in 
6 pages

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
Peter Eisentraut
Date:
Jim Cox wrote:
> A patch s/b attached which adds a "VERBOSE" option to the CLUSTER command as
> mentioned in the following TODO item for CLUSTER: "Add VERBOSE option
> to report tables as they are processed, like VACUUM VERBOSE".

I have committed this version of your patch, but moving the VERBOSE 
option directly after the CLUSTER word, to be more consistent with 
VACUUM and ANALYZE.  Also I added the corresponding support to the 
clusterdb command.

Additional processing information as discussed later in the thread can 
now be added easily, but I think there was no consensus on what exactly 
to print.



Re: TODO item: adding VERBOSE option to CLUSTER [with patch]

From
"Kevin Grittner"
Date:
>>> Peter Eisentraut <peter_e@gmx.net> wrote: 
> Additional processing information as discussed later in the thread
can 
> now be added easily, but I think there was no consensus on what
exactly 
> to print.
Since I use CLUSTER almost exclusively to eliminate bloat, I'd like to
see the before and after value for pg_total_relation_size for each
table.  Some didn't seem terribly interested in that, as they use the
command primarily to sequence the heap, so some measure(s) of how
out-of-order the heap was would make sense.
We are talking about an option named VERBOSE, so there's no real
reason not to throw in all useful information.
-Kevin