Thread: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

04 April 2009, 01:13:41

Per previous discussion.

http://archives.postgresql.org/message-id/8066.1229106059@sss.pgh.pa.us
http://archives.postgresql.org/message-id/603c8f070904021926g92eb55sdfc68141133957c1@mail.gmail.com

I decided on SET DISTINCT rather than SET NDISTINCT for the DDL
command because DISTINCT is already a keyword, and there didn't seem
to be any compelling reason to invent a new one.

...Robert

Attachment

ndistinct-1.patch

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

04 April 2009, 20:05:05

Robert Haas <robertmhaas@gmail.com> writes:
> Per previous discussion.
> http://archives.postgresql.org/message-id/8066.1229106059@sss.pgh.pa.us
> http://archives.postgresql.org/message-id/603c8f070904021926g92eb55sdfc68141133957c1@mail.gmail.com

I'm not thrilled about adding a column to pg_attribute for this.
Isn't there some way of keeping it in pg_statistic?
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

04 April 2009, 22:56:48

On Sat, Apr 4, 2009 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Per previous discussion.
>> http://archives.postgresql.org/message-id/8066.1229106059@sss.pgh.pa.us
>> http://archives.postgresql.org/message-id/603c8f070904021926g92eb55sdfc68141133957c1@mail.gmail.com
>
> I'm not thrilled about adding a column to pg_attribute for this.
> Isn't there some way of keeping it in pg_statistic?

I don't like the idea of keeping it in pg_statistic.  Right now, all
of the data in pg_statistic is transient, so you could theoretically
truncate the table at any time without losing anything permanent.
It's true that we don't do that right now, but it seems cleaner to
keep the data generated by the analyzer separate from the stuff we
consider part of the structure of the database.  If we did put the
data in pg_statistic, then we'd have to teach vacuum that when it
writes out new statistics, it also has to copy over this setting from
the previous version of the tuple.  And that means it would have to
lock the tuples against concurrent updates while analyze is running.
Also, if someone happened to run ALTER TABLE SET DISTINCT before the
first run of ANALYZE on that table (for example, during pg_load)
there'd be no existing row in pg_statistic for the DDL command to
update, so we'd need to create and insert a fake row (which,
incidentally, would blow up any concurrent ANALYZE already in progress
when it got and tried to insert the resulting rows into pg_statistic,
violating the unique constraint).  All in all it seems rather messy.

What is the specific nature of your concern?  I thought about the
possibility of a distributed performance penalty that might be
associated with enlarging pg_attribute, but increasing the size of a
structure that is already 112 bytes by another 4 doesn't seem likely
to be significant, especially since we're not crossing a power-of-two
boundary.  It might be possible to reclaim 4 bytes by changing
attstattarget and attndims from int4 to int2, but I'd rather do that
as a separate patch.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Alvaro Herrera

Date:

04 April 2009, 23:32:12

Robert Haas escribió:
> On Sat, Apr 4, 2009 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> Per previous discussion.
> >> http://archives.postgresql.org/message-id/8066.1229106059@sss.pgh.pa.us
> >> http://archives.postgresql.org/message-id/603c8f070904021926g92eb55sdfc68141133957c1@mail.gmail.com
> >
> > I'm not thrilled about adding a column to pg_attribute for this.
> > Isn't there some way of keeping it in pg_statistic?
> 
> I don't like the idea of keeping it in pg_statistic.  Right now, all
> of the data in pg_statistic is transient, so you could theoretically
> truncate the table at any time without losing anything permanent.

Maybe use a new catalog?

> What is the specific nature of your concern?  I thought about the
> possibility of a distributed performance penalty that might be
> associated with enlarging pg_attribute, but increasing the size of a
> structure that is already 112 bytes by another 4 doesn't seem likely
> to be significant, especially since we're not crossing a power-of-two
> boundary.

FWIW it has been said that whoever is concerned about pg_attribute bloat
should be first looking at getting rid of the redundant entries for
system columns, for each and every table.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

04 April 2009, 23:56:10

Alvaro Herrera <alvherre@commandprompt.com> writes:
> FWIW it has been said that whoever is concerned about pg_attribute bloat
> should be first looking at getting rid of the redundant entries for
> system columns, for each and every table.

That's been overtaken by events, unfortunately: we now need those
entries to carry per-column permissions on system columns.  So they're
no longer merely overhead.

Possibly we could hack things so that only system columns with
non-default permissions are actually stored, but that feels like
a kluge.
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

04 April 2009, 23:56:25

On Sat, Apr 4, 2009 at 10:31 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Robert Haas escribió:
>> On Sat, Apr 4, 2009 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Robert Haas <robertmhaas@gmail.com> writes:
>> >> Per previous discussion.
>> >> http://archives.postgresql.org/message-id/8066.1229106059@sss.pgh.pa.us
>> >> http://archives.postgresql.org/message-id/603c8f070904021926g92eb55sdfc68141133957c1@mail.gmail.com
>> >
>> > I'm not thrilled about adding a column to pg_attribute for this.
>> > Isn't there some way of keeping it in pg_statistic?
>>
>> I don't like the idea of keeping it in pg_statistic.  Right now, all
>> of the data in pg_statistic is transient, so you could theoretically
>> truncate the table at any time without losing anything permanent.
>
> Maybe use a new catalog?

If we go that route, we would probably make sense to move
attstattarget there as well.  Obviously it wouldn't make sense to move
anything that's in the critical path of ordinary database operations,
but maybe attislocal or attinhcount could be moved as well.  But I'm
not sure it's really warranted because, AFAIK, we have no evidence
that this is a real as opposed to a theoretical problem, and even if
we moved all of that stuff, that's only 12 bytes, and now you have
another table that's competing for space in the system cache.  If
someone could demonstrate (say, by reducing NAMEDATALEN) that a
smaller pg_attribute structure would generate a real performance
benefit, then it would be worth spending the time to figure out a way
to make that happen (obviously without actually reducing NAMEDATALEN,
that's only a possible way to measure the impact).

>> What is the specific nature of your concern?  I thought about the
>> possibility of a distributed performance penalty that might be
>> associated with enlarging pg_attribute, but increasing the size of a
>> structure that is already 112 bytes by another 4 doesn't seem likely
>> to be significant, especially since we're not crossing a power-of-two
>> boundary.
>
> FWIW it has been said that whoever is concerned about pg_attribute bloat
> should be first looking at getting rid of the redundant entries for  BN
> system columns, for each and every table.

That's a different kind of bloat (more rows vs. larger rows) but a
valid point all the same.  I suspect neither type has much practical
impact, and that if we listed all the performance problems that
PostgreSQL has today, neither would be in the top 500.  Bad ndistinct
estimates would be, however.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

05 April 2009, 00:14:25

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Apr 4, 2009 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not thrilled about adding a column to pg_attribute for this.

> What is the specific nature of your concern?

Actually, I'm more worried about the TupleDesc data structure than
the catalogs.  There are TupleDescs all over the backend, and I've
seen evidence in profiles that setting them up is a nontrivial cost.

You're very possibly right that four more bytes is in the noise,
though.

Two other comments now that I've read a little further:

* This isn't happening for 8.4, so adjust the pg_dump code.

* Using an integer is bogus.  Use a float4 and forget the weird scaling;
it should have exactly the same interpretation as stadistinct, except
for 0 meaning "unset" instead of "unknown".
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

05 April 2009, 08:12:21

On Sat, Apr 4, 2009 at 11:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sat, Apr 4, 2009 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I'm not thrilled about adding a column to pg_attribute for this.
>
>> What is the specific nature of your concern?
>
> Actually, I'm more worried about the TupleDesc data structure than
> the catalogs.  There are TupleDescs all over the backend, and I've
> seen evidence in profiles that setting them up is a nontrivial cost.
>
> You're very possibly right that four more bytes is in the noise,
> though.
>
> Two other comments now that I've read a little further:
>
> * This isn't happening for 8.4, so adjust the pg_dump code.

I thought about writing 80500, but the effect of that would have been
to render the patch impossible to test, so I didn't. :-)

I think I'll be very lucky if that's the most bitrot this accumulates
between now and when the tree is open for 8.5 development.  System
catalog changes stink in that regard.  I suppose we could tag and
branch the tree now, but that would just move the work of fixing any
subsequent conflicts from patch authors to committers, which is sort
of a zero-sum game.

> * Using an integer is bogus.  Use a float4 and forget the weird scaling;
> it should have exactly the same interpretation as stadistinct, except
> for 0 meaning "unset" instead of "unknown".

I think there's a pretty good chance that will lead to a complaint
that is some variant of the following: "I ran this command and then I
did a pg_dump and the output doesn't match what I put in." Or maybe,
"I did a dump and a restore on a different machine with a different
architecture and then another dump and then I diffed them and this
popped out."

I have a deep-seated aversion to storing important values as float,
and we seem to have no other floats anywhere in our DDL, so I was a
little leery about breaking new ground.  There's nothing particularly
special about the scaling that the pg_statistic stuff uses, and it's
basically pretty obscure internal stuff anyway, so I think the
consistency argument is fairly weak.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

05 April 2009, 20:56:59

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Apr 4, 2009 at 11:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * Using an integer is bogus. �Use a float4 and forget the weird scaling;
>> it should have exactly the same interpretation as stadistinct, except
>> for 0 meaning "unset" instead of "unknown".

> I have a deep-seated aversion to storing important values as float,

[ shrug... ]  Precision is not important for this value: we are not
anywhere near needing more than six significant digits for our
statistical estimates.  Range, on the other hand, could be important
when dealing with really large tables.  So I'm much more concerned
about whether the definition is too restrictive than about whether
some uninformed person complains about exactness.
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

05 April 2009, 22:54:19

On Sun, Apr 5, 2009 at 7:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sat, Apr 4, 2009 at 11:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> * Using an integer is bogus.  Use a float4 and forget the weird scaling;
>>> it should have exactly the same interpretation as stadistinct, except
>>> for 0 meaning "unset" instead of "unknown".
>
>> I have a deep-seated aversion to storing important values as float,
>
> [ shrug... ]  Precision is not important for this value: we are not
> anywhere near needing more than six significant digits for our
> statistical estimates.  Range, on the other hand, could be important
> when dealing with really large tables.  So I'm much more concerned
> about whether the definition is too restrictive than about whether
> some uninformed person complains about exactness.

I thought about that, and if you think that's better, I can implement
it that way.  Personally, I'm unconvinced.  The use case for
specifying a number of distinct values in excess of 2 billion as an
absolute number rather than as a percentage of the table size seems
pretty weak to me.  I would rather use integers and have it be clean.
But I would rather have it your way than not have it at all.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

05 April 2009, 23:01:57

Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Apr 5, 2009 at 7:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> [ shrug... ] �Precision is not important for this value: we are not
>> anywhere near needing more than six significant digits for our
>> statistical estimates. �Range, on the other hand, could be important
>> when dealing with really large tables.

> I thought about that, and if you think that's better, I can implement
> it that way.  Personally, I'm unconvinced.  The use case for
> specifying a number of distinct values in excess of 2 billion as an
> absolute number rather than as a percentage of the table size seems
> pretty weak to me.

I was more concerned about the other end of it.  Your patch sets a
not-too-generous lower bound on the percentage that can be represented ...
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

05 April 2009, 23:38:07

On Sun, Apr 5, 2009 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Apr 5, 2009 at 7:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> [ shrug... ]  Precision is not important for this value: we are not
>>> anywhere near needing more than six significant digits for our
>>> statistical estimates.  Range, on the other hand, could be important
>>> when dealing with really large tables.
>
>> I thought about that, and if you think that's better, I can implement
>> it that way.  Personally, I'm unconvinced.  The use case for
>> specifying a number of distinct values in excess of 2 billion as an
>> absolute number rather than as a percentage of the table size seems
>> pretty weak to me.
>
> I was more concerned about the other end of it.  Your patch sets a
> not-too-generous lower bound on the percentage that can be represented ...

Huh?  With a scaling factor of 1 million, you can represent anything
down to about 0.000001, which is apparently all you can expect out of
a float4 anyway.

http://archives.postgresql.org/pgsql-bugs/2009-01/msg00039.php

In fact, we could change the scaling factor to 1 billion if you like,
and it would then give you MORE significant digits than you'll get out
of a float4 (and you'll be able to predict the exact number that
you're gonna get).  If someone has billions of rows in the table but
only thousands of distinct values, I would expect them to run a script
to count 'em up and specify the exact number, rather than specifying
some microscopic percentage.  But there's certainly enough range in
int4 to tack on three more decimal places if you think it's warranted.

(It's also worth pointing out that the calculations we do with
ndistinct are pretty approximations anyway.  If the difference between
stadistinct = -1 x 10^-6 and stadistinct = -1.4^10-6 is the thing
that's determining whether the planner is picking the correct plan on
your 4-billion-row table, you probably want to tune some other
parameter as well so as to get further away from that line.  Just
getting the value in the ballpark should be a big improvement over how
things stand now.)

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

05 April 2009, 23:40:34

On Sun, Apr 5, 2009 at 10:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Apr 5, 2009 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Sun, Apr 5, 2009 at 7:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> [ shrug... ]  Precision is not important for this value: we are not
>>>> anywhere near needing more than six significant digits for our
>>>> statistical estimates.  Range, on the other hand, could be important
>>>> when dealing with really large tables.
>>
>>> I thought about that, and if you think that's better, I can implement
>>> it that way.  Personally, I'm unconvinced.  The use case for
>>> specifying a number of distinct values in excess of 2 billion as an
>>> absolute number rather than as a percentage of the table size seems
>>> pretty weak to me.
>>
>> I was more concerned about the other end of it.  Your patch sets a
>> not-too-generous lower bound on the percentage that can be represented ...
>
> Huh?  With a scaling factor of 1 million, you can represent anything
> down to about 0.000001, which is apparently all you can expect out of
> a float4 anyway.
>
> http://archives.postgresql.org/pgsql-bugs/2009-01/msg00039.php

I guess I'm wrong here - 0.00001 is only one SIGNIFICANT digit.  But
the point remains that specifying ndistinct in ppm is probably enough
for most cases, and ppb (which would still fit in int4) even more so.
I don't think we need to worry about people with trillions of rows
(and even they could still specify an absolute number).

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

06 April 2009, 00:33:36

Robert Haas <robertmhaas@gmail.com> writes:
> (It's also worth pointing out that the calculations we do with
> ndistinct are pretty approximations anyway.  If the difference between
> stadistinct = -1 x 10^-6 and stadistinct = -1.4^10-6 is the thing
> that's determining whether the planner is picking the correct plan on
> your 4-billion-row table,

No, it's the loss of ability to set stadistinct to -1e-9 or -1e-12 or
-1e-15 or so that is bothering me.  In a table with billions of rows
that could become important.

Or maybe not; but the real bottom line here is that it is 100% silly to
use a different representation in this column than is used in the
underlying stadistinct column.  All you accomplish by that is to impose
on the user the intersection of the accuracy/range limits of the two
different representations.
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

06 April 2009, 07:24:37

On Sun, Apr 5, 2009 at 11:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> (It's also worth pointing out that the calculations we do with
>> ndistinct are pretty approximations anyway.  If the difference between
>> stadistinct = -1 x 10^-6 and stadistinct = -1.4^10-6 is the thing
>> that's determining whether the planner is picking the correct plan on
>> your 4-billion-row table,
>
> No, it's the loss of ability to set stadistinct to -1e-9 or -1e-12 or
> -1e-15 or so that is bothering me.  In a table with billions of rows
> that could become important.
>
> Or maybe not; but the real bottom line here is that it is 100% silly to
> use a different representation in this column than is used in the
> underlying stadistinct column.  All you accomplish by that is to impose
> on the user the intersection of the accuracy/range limits of the two
> different representations.

Well, I think I was pretty clear about what I was trying to
accomplish.  I think there are more people who care about pg_dump
output being diffable than there are who need to set ndistinct more
accurately than 1 ppm and yet not as an integer.  Perhaps if any of
those people are reading this thread they could chime in.  Otherwise,
I will implement as you propose.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Stephen Frost

Date:

06 April 2009, 08:30:59

* Robert Haas (robertmhaas@gmail.com) wrote:
> Well, I think I was pretty clear about what I was trying to
> accomplish.  I think there are more people who care about pg_dump
> output being diffable than there are who need to set ndistinct more
> accurately than 1 ppm and yet not as an integer.  Perhaps if any of
> those people are reading this thread they could chime in.  Otherwise,
> I will implement as you propose.

I do such diffs pretty often, but I don't think I've *ever* done it on
catalog tables..  Perhaps it'll come up in the future, but I doubt it.
Stephen

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

06 April 2009, 11:35:27

On Mon, Apr 6, 2009 at 7:30 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> Well, I think I was pretty clear about what I was trying to
>> accomplish.  I think there are more people who care about pg_dump
>> output being diffable than there are who need to set ndistinct more
>> accurately than 1 ppm and yet not as an integer.  Perhaps if any of
>> those people are reading this thread they could chime in.  Otherwise,
>> I will implement as you propose.
>
> I do such diffs pretty often, but I don't think I've *ever* done it on
> catalog tables..  Perhaps it'll come up in the future, but I doubt it.
>
>        Stephen

Well the point is when you dump a user table, it will dump this
setting along with it, same as it does now for statistics_target.  So
if you diff the DDL you might see differences in rounding.  If you
only diff the data, it won't matter unless, as you say, you're dumping
pg_attribute itself.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

06 April 2009, 11:49:55

Robert Haas <robertmhaas@gmail.com> writes:
> Well, I think I was pretty clear about what I was trying to
> accomplish.  I think there are more people who care about pg_dump
> output being diffable than there are who need to set ndistinct more
> accurately than 1 ppm and yet not as an integer.

My, you *are* paranoid about float4 aren't you?  Once you've stored
a given value, it's always going to dump the same.  Diffing different
dumps isn't going to be a problem.  I concede that it might look
different from what you put in originally, but we make no effort to
guarantee that for DDL anyway.

(It is true that with pg_dump's default float_extra_digits setting,
outputs from this column might look uglier than they need to.
I don't see anything wrong with having pg_dump forcibly round the
value to five or so digits, though.)
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Stephen Frost

Date:

06 April 2009, 11:50:58

Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Apr 6, 2009 at 7:30 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > I do such diffs pretty often, but I don't think I've *ever* done it on
> > catalog tables..  Perhaps it'll come up in the future, but I doubt it.
>
> Well the point is when you dump a user table, it will dump this
> setting along with it, same as it does now for statistics_target.  So
> if you diff the DDL you might see differences in rounding.  If you
> only diff the data, it won't matter unless, as you say, you're dumping
> pg_attribute itself.

The rounding when you dump it out is going to be consistant though, is
it not?  I mean, you might get a difference between what you try to set
it to and the result that you get from pg_dump, but if you compare one
pg_dump to another done later there shouldn't be any change, right?  If
there is an architecture difference then I could maybe see it, but I
thought float-handling was well-defined on systems we run on.
Thanks,
    Stephen

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

06 April 2009, 12:15:09

On Mon, Apr 6, 2009 at 10:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Well, I think I was pretty clear about what I was trying to
>> accomplish.  I think there are more people who care about pg_dump
>> output being diffable than there are who need to set ndistinct more
>> accurately than 1 ppm and yet not as an integer.
>
> My, you *are* paranoid about float4 aren't you?

Yes - perhaps unreasonably so.  I've had so many bad experiences with
floating-point arithmetic that I've stopped trying to understand all
of the crazy things that can happen and started just avoiding it like
the plague.  Long live numeric!

> Once you've stored
> a given value, it's always going to dump the same.  Diffing different
> dumps isn't going to be a problem.  I concede that it might look
> different from what you put in originally, but we make no effort to
> guarantee that for DDL anyway.
>
> (It is true that with pg_dump's default float_extra_digits setting,
> outputs from this column might look uglier than they need to.
> I don't see anything wrong with having pg_dump forcibly round the
> value to five or so digits, though.)

So based on this comment and Stephen's remarks, I'm going to assume
that I'm succumbing to a fit of unjustified paranoia and re-implement
as you suggest.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

03 May 2009, 17:14:17

On Mon, Apr 6, 2009 at 11:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> So based on this comment and Stephen's remarks, I'm going to assume
> that I'm succumbing to a fit of unjustified paranoia and re-implement
> as you suggest.

OK, new version of patch, this time with the weird scaling removed and
the datatype changed to float4.

I have not changed the minimum value for remoteVersion in pg_dump.c,
as that would make the patch not able to be tested now.  So that line
and the comment two lines following will need to be updated prior to
application.  Also requires catversion bump.

...Robert

Attachment

ndistinct-2.patch

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Jaime Casanova

Date:

04 May 2009, 00:14:24

On Sun, May 3, 2009 at 3:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 6, 2009 at 11:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> So based on this comment and Stephen's remarks, I'm going to assume
>> that I'm succumbing to a fit of unjustified paranoia and re-implement
>> as you suggest.
>
> OK, new version of patch, this time with the weird scaling removed and
> the datatype changed to float4.
>

In this paragraph i think you mean: "ALTER TABLE ... ALTER COLUMN ...
SET DISTINCT"?
   <para>
+    The analyzer also estimates the number of distinct values that appear
+    in each column.  Because only a subset of pages are scanned, this method
+    can sometimes be quite inaccurate, especially for large tables.  If this
+    inaccuracy leads to bad query plans, the analyzer can be forced to use
+    a more correct value with <command>ALTER TABLE ... ALTER COLUMN ... SET
+    STATISTICS</command> (see <xref linkend="sql-altertable"
+    endterm="sql-altertable-title">).
+   </para>

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

04 May 2009, 23:13:39

On Sun, May 3, 2009 at 11:14 PM, Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:
> On Sun, May 3, 2009 at 3:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Apr 6, 2009 at 11:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> So based on this comment and Stephen's remarks, I'm going to assume
>>> that I'm succumbing to a fit of unjustified paranoia and re-implement
>>> as you suggest.
>>
>> OK, new version of patch, this time with the weird scaling removed and
>> the datatype changed to float4.
>
> In this paragraph i think you mean: "ALTER TABLE ... ALTER COLUMN ...
> SET DISTINCT"?

Ah, yes, so I do.  Corrected patch attached.

> > I have not changed the minimum value for remoteVersion in pg_dump.c,
> > as that would make the patch not able to be tested now.  So that line
> > and the comment two lines following will need to be updated prior to
> > application.  Also requires catversion bump.

...Robert

Attachment

ndistinct-3.patch

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Joshua Tolley

Date:

04 May 2009, 23:41:58

On Mon, May 04, 2009 at 10:13:31PM -0400, Robert Haas wrote:

<nit>
> +       own analysis indicates otherwie).  When set to a negative value, which
s/otherwie/otherwise
</nit>


A question: why does attdistinct become entry #5 instead of going at the end?
I assume it's because the order here controls the column order, and it makes
sense to have attdistinct next to attstattarget, since they're related. Is
that right? Thanks in advance...

> --- 185,210 ----
>    * ----------------
>    */
>
> ! #define Natts_pg_attribute                19
>   #define Anum_pg_attribute_attrelid        1
>   #define Anum_pg_attribute_attname        2
>   #define Anum_pg_attribute_atttypid        3
>   #define Anum_pg_attribute_attstattarget 4
> ! #define Anum_pg_attribute_attdistinct    5
> ! #define Anum_pg_attribute_attlen        6
> ! #define Anum_pg_attribute_attnum        7

- Josh / eggyknap

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

05 May 2009, 13:08:27

On Mon, May 4, 2009 at 10:41 PM, Joshua Tolley <eggyknap@gmail.com> wrote:
> On Mon, May 04, 2009 at 10:13:31PM -0400, Robert Haas wrote:
>
> <nit>
>> +       own analysis indicates otherwie).  When set to a negative value, which
> s/otherwie/otherwise
> </nit>
>
>
> A question: why does attdistinct become entry #5 instead of going at the end?
> I assume it's because the order here controls the column order, and it makes
> sense to have attdistinct next to attstattarget, since they're related. Is
> that right? Thanks in advance...

Yep, that was my thought.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

05 May 2009, 13:14:36

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, May 4, 2009 at 10:41 PM, Joshua Tolley <eggyknap@gmail.com> wrote:
>> A question: why does attdistinct become entry #5 instead of going at the end?
>> I assume it's because the order here controls the column order, and it makes
>> sense to have attdistinct next to attstattarget, since they're related. Is
>> that right? Thanks in advance...

> Yep, that was my thought.

We generally want fixed-size columns before variable-size ones, to ease
accessing them from C code.  So it shouldn't go at the end in any case.
Beyond that it's mostly aesthetics, with maybe some thought for avoiding
unnecessary alignment padding.
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

05 May 2009, 13:16:35

On Tue, May 5, 2009 at 12:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, May 4, 2009 at 10:41 PM, Joshua Tolley <eggyknap@gmail.com> wrote:
>>> A question: why does attdistinct become entry #5 instead of going at the end?
>>> I assume it's because the order here controls the column order, and it makes
>>> sense to have attdistinct next to attstattarget, since they're related. Is
>>> that right? Thanks in advance...
>
>> Yep, that was my thought.
>
> We generally want fixed-size columns before variable-size ones, to ease
> accessing them from C code.  So it shouldn't go at the end in any case.
> Beyond that it's mostly aesthetics, with maybe some thought for avoiding
> unnecessary alignment padding.

I thought about that as well; it should be OK where it is, in that regard.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Dimitri Fontaine

Date:

17 July 2009, 12:17:49

Hi,

Le 3 mai 09 à 22:13, Robert Haas a écrit :
> OK, new version of patch, this time with the weird scaling removed and
> the datatype changed to float4.

You've been assigning me this patch review, so here it goes :)

> I have not changed the minimum value for remoteVersion in pg_dump.c,
> as that would make the patch not able to be tested now.  So that line
> and the comment two lines following will need to be updated prior to
> application.  Also requires catversion bump.

I guess now would be a good time to fix this part of the patch?

I couldn't apply it to current head because of bitrot. It applies fine
to git commit bcaef8b5a0e2d5c143dabd8516090a09e39b27b8 [1] but from
there the automatic forward merge of git isn't able to merge
pg_attribute.h. As I don't have as much time to give to the review as
I'd like, I'd very much welcome if you could fix this part of your
patch and I'll resume my work thereafter.
I'll change the patch status to "Waiting on Author" as soon as I'll
have this mail id.


Now I've had time to read the code, here are my raw notes:

pg_dump.c:            tbinfo->attstattarget[j] = atoi(PQgetvalue(res, j,
i_attstattarget));
+           tbinfo->attdistinct[j] = strdup(PQgetvalue(res, j,
i_attdistinct));
...
+           if (atof(tbinfo->attdistinct[j]) != 0 &&
+               !tbinfo->attisdropped[j])
 - style issue, convert at PQgetvalue() time
 - prefer strtod() over atof? Here's what my local man page has to
say about the case:
     The atof() function has been deprecated by strtod() and should
not be used in     new code.

tablecmds.c:
+   switch (nodeTag(newValue))
+   {
+       case T_Integer:
...
+       case T_Float:
...
+         default:
+             elog(ERROR, "unrecognized node type: %d",
+                  (int) nodeTag(newValue));
 What about adding the following before the switch, to do like
surrounding code?Assert(IsA(newValue, Integer) || IsA(newValue, Float));

Given your revised version I'll try to play around with ndistinct
behavior and exercise dump and restore, etc, but for now I'll pause my
work.

I guess I'll have a second look at the code to check that it's all in
the spirit of surrounding code, which I didn't complete yet (wanted to
exercise my abilities to apply the patch from a past commit and
forward-merge from there).

Regards,
--
dim

[1]: http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=bcaef8b5a0e2d5c143dabd8516090a09e39b27b8

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

marcin mank

Date:

17 July 2009, 12:29:25

ALTER COLUMN SET DISTINCT

feels like adding a unique constraint.

ALTER COLUMN SET STATISTICS DISTINCT ?
ALTER COLUMN SET STATISTICS NDISTINCT ?

Greetings
Marcin

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

18 July 2009, 15:55:37

On Fri, Jul 17, 2009 at 11:10 AM, Dimitri Fontaine<dfontaine@hi-> I
couldn't apply it to current head because of bitrot. It applies fine
to
> git commit bcaef8b5a0e2d5c143dabd8516090a09e39b27b8 [1] but

This is one of the things that I hate about the requirement to post
context diffs: filterdiff, at least for me, strips out the git tags
that indicate the base rev of the patch.  In fact that rev is BEFORE
the one I based the patch on, which was
64b2da3f7778bc6fd3d213ceb9e73ff3b6e03890, and it actually applies OK
up until 0e3abe7ec78a3d51032d684598f188b0b0304fe9, the commit
immediately preceding the 8.4 pgindent run.

> Now I've had time to read the code, here are my raw notes:
>
> pg_dump.c:
>            tbinfo->attstattarget[j] = atoi(PQgetvalue(res, j,
> i_attstattarget));
> +           tbinfo->attdistinct[j] = strdup(PQgetvalue(res, j,
> i_attdistinct));
> ...
> +           if (atof(tbinfo->attdistinct[j]) != 0 &&
> +               !tbinfo->attisdropped[j])
>
>  - style issue, convert at PQgetvalue() time

Ah, good catch.

>  - prefer strtod() over atof? Here's what my local man page has to say about
> the case:
>
>     The atof() function has been deprecated by strtod() and should not be
> used in
>     new code.

Hmm, I didn't know that (my man page doesn't contain that language).
It looks like strtod() is more widely used in the existing source code
than atof(), so I'll change it (atof() is however used in the
floatVal() macro).

> tablecmds.c:
> +   switch (nodeTag(newValue))
> +   {
> +       case T_Integer:
> ...
> +       case T_Float:
> ...
> +               default:
> +                       elog(ERROR, "unrecognized node type: %d",
> +                                (int) nodeTag(newValue));
>
>  What about adding the following before the switch, to do like surrounding
> code?
>        Assert(IsA(newValue, Integer) || IsA(newValue, Float));

Not a good plan.  In my experience, gcc doesn't like switch ()
statements over enums that don't list all the values, unless they have
a default clause; it masterminds by giving you a warning that you've
"inadvertently" left out some values.

> Given your revised version I'll try to play around with ndistinct behavior
> and exercise dump and restore, etc, but for now I'll pause my work.
>
> I guess I'll have a second look at the code to check that it's all in the
> spirit of surrounding code, which I didn't complete yet (wanted to exercise
> my abilities to apply the patch from a past commit and forward-merge from
> there).

OK.  My one concern about updating this is that Peter is currently
reviewing my patch to autogenerate headers & bki stuff.  If that gets
committed it's going to break this all again, and I'd rather just fix
it once (especially since that patch would reduce the amount of fixing
needed here by 50%).  Also if this gets applied after being fixed it
will break that one.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Dimitri Fontaine

Date:

18 July 2009, 18:55:09

Le 18 juil. 09 à 20:55, Robert Haas a écrit :
> that indicate the base rev of the patch.  In fact that rev is BEFORE
> the one I based the patch on, which was
> 64b2da3f7778bc6fd3d213ceb9e73ff3b6e03890, and it actually applies OK
> up until 0e3abe7ec78a3d51032d684598f188b0b0304fe9, the commit
> immediately preceding the 8.4 pgindent run.

Robert told me on rrreviewers it's all whitespace stuff, which I could
confirmed after having trained my eyes to recognize the pattern and
mix and match whitespace and new column in the middle. Patch applied,
I'll have to update the version check, but the following paste means
I'm the one having to work on it now:
   Table "pg_catalog.pg_attribute"   Column     |   Type    | Modifiers
---------------+-----------+-----------
...
attstattarget | integer   | not null
attdistinct   | real      | not null
attlen        | smallint  | not null

> Hmm, I didn't know that (my man page doesn't contain that language).
> It looks like strtod() is more widely used in the existing source code
> than atof(), so I'll change it (atof() is however used in the
> floatVal() macro).

Considering I don't have an updated version when I start again, I'll
have a try at it: don't rush into it yourself :)

>>  What about adding the following before the switch, to do like
>> surrounding
>> code?
>>        Assert(IsA(newValue, Integer) || IsA(newValue, Float));
>
> Not a good plan.  In my experience, gcc doesn't like switch ()
> statements over enums that don't list all the values, unless they have
> a default clause; it masterminds by giving you a warning that you've
> "inadvertently" left out some values.

Maybe this could still be a supplementary barrier for cassert builds
in order to match existing code?
--
dim

I'm going to update the status of patch and resume work on it next week.

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

29 July 2009, 08:08:04

On Sat, Jul 18, 2009 at 5:54 PM, Dimitri Fontaine<dfontaine@hi-media.com> wrote:
> I'm going to update the status of patch and resume work on it next week.

Any update?

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Dimitri Fontaine

Date:

29 July 2009, 11:33:22

Le 29 juil. 09 à 13:07, Robert Haas a écrit :
> On Sat, Jul 18, 2009 at 5:54 PM, Dimitri Fontaine<dfontaine@hi-media.com
> > wrote:
>> I'm going to update the status of patch and resume work on it next
>> week.
>
> Any update?


Sorry about delays, resuming tomorrow evening is the current plan.
--
dim

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Dimitri Fontaine

Date:

31 July 2009, 12:10:29

Hi,

Finally some update on the patch.

Le 18 juil. 09 à 20:55, Robert Haas a écrit :
> This is one of the things that I hate about the requirement to post
> context diffs: filterdiff, at least for me, strips out the git tags
> that indicate the base rev of the patch.

Yes, and as I didn't have the time to install filterdiff I've attached
a revision of your patch in git format which adresses the problem I
mentioned, in a tarball also containing raw notes, tests, and
regression.{out,diffs}.

mqke check is failing on opr_sanity test in what looks like an
ordering issue, but I didn't feel confident enough to adapt the .out
to force the regression into passing.

>>  - style issue, convert at PQgetvalue() time
>
> Ah, good catch.

Fixed in the updated version, which uses a float4 variable in pg_dump
and strtod() rather than atof. This last point is maybe overkill as
I'm using:
    tbinfo->attdistinct[j] = strtod(PQgetvalue(res, j, i_attdistinct),
(char **)NULL);

>>  What about adding the following before the switch, to do like
>> surrounding
>> code?
>>        Assert(IsA(newValue, Integer) || IsA(newValue, Float));
>
> Not a good plan.  In my experience, gcc doesn't like switch ()
> statements over enums that don't list all the values, unless they have
> a default clause; it masterminds by giving you a warning that you've
> "inadvertently" left out some values.

I've left this part alone but still wonders about this, which is a new
user visible error message kind:
        default:
            elog(ERROR, "unrecognized node type: %d",
                 (int) nodeTag(newValue));

>> Given your revised version I'll try to play around with ndistinct
>> behavior
>> and exercise dump and restore, etc, but for now I'll pause my work.

I failed to have 0 to allow for analyze to compute a value again, as
shown in the raw notes in attachement:

dim=# alter table foo alter column x set distinct 1;
ALTER TABLE
...
dim=# alter table foo alter column x set distinct 0;
ALTER TABLE
dim=# analyze verbose foo;
INFO:  analyzing "public.foo"
INFO:  "foo": scanned 4 of 4 pages, containing 1000 live rows and 0
dead rows; 1000 rows in sample, 1000 estimated total rows
ANALYZE
dim=# select attname, attdistinct from pg_attribute where attrelid =
'foo'::regclass and attname = 'x';
  attname | attdistinct
---------+-------------
  x       |           0
(1 row)

What I understand from the doc part of your work contradicts what I
see here...

Regards,
--
dim

Will mark as Waiting on Author, as I need Robert to tell me what to do
next.

Attachment

ndistinct.tgz

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

01 August 2009, 01:08:19

On Fri, Jul 31, 2009 at 11:09 AM, Dimitri
Fontaine<dfontaine@hi-media.com> wrote:
> I failed to have 0 to allow for analyze to compute a value again, as shown
> in the raw notes in attachement:

I'm lost.  I think you're getting the new column attdistinct mixed up
with the existing column stadistinct.  attdistinct is always going to
have whatever value you assign it.  stadistinct will get attdistinct
!= 0 ? attdistinct : <result of analyze calculation>.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Dimitri Fontaine

Date:

01 August 2009, 08:27:56

Hi,

Le 1 août 09 à 06:08, Robert Haas a écrit :
> I'm lost.  I think you're getting the new column attdistinct mixed up
> with the existing column stadistinct.  attdistinct is always going to
> have whatever value you assign it.  stadistinct will get attdistinct
> != 0 ? attdistinct : <result of analyze calculation>.

haha!
Sorry about that, I felt like I had to run against time and once again
I lost.

dim=# alter table foo alter column x set distinct 0.25;
ALTER TABLE
dim=# select stadistinct from pg_statistic where starelid =
'foo'::regclass;
-[ RECORD 1 ]-----
stadistinct | 0.25

dim=# alter table foo alter column x set distinct 0;
ALTER TABLE
dim=# analyze verbose foo;
INFO:  analyzing "public.foo"
INFO:  "foo": scanned 4 of 4 pages, containing 1000 live rows and 0
dead rows; 1000 rows in sample, 1000 estimated total rows
ANALYZE
dim=# select stadistinct from pg_statistic where starelid =
'foo'::regclass; stadistinct
-------------      -0.652
(1 row)

I'm back on track, it seems. Time to further test this, but code
review is ok for me (except for the strange new error already
mentioned), doc is ok too, and basic behaviour is sane. I just checked
pg_dump dim|psql foo and new database (foo) has same explicit distinct
settings than origin, both for pg_attribute and pg_statistic.

Unless you want me to test for impact on planner choices (even if we
already know it has impact on pg_statistic), I'd say it's ready for
commiter.

--
dim

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

02 August 2009, 12:32:08

On Sat, Aug 1, 2009 at 7:27 AM, Dimitri Fontaine<dfontaine@hi-media.com> wrote:
> Unless you want me to test for impact on planner choices (even if we already
> know it has impact on pg_statistic), I'd say it's ready for commiter.

Great, thanks for the review.  I think if stadistinct is getting
populated that's sufficient in terms of testing; the problems caused
by inaccurate ndistinct estimates have been discussed elsewhere.  I
see you've already marked this Ready for Committer, thanks.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

02 August 2009, 14:23:44

There wasn't any response to this comment:

marcin mank <marcin.mank@gmail.com> writes:
> ALTER COLUMN SET DISTINCT
> feels like adding a unique constraint.

> ALTER COLUMN SET STATISTICS DISTINCT ?
> ALTER COLUMN SET STATISTICS NDISTINCT ?

I don't want to add a new keyword, but "SET STATISTICS DISTINCT" would
be an easy change.  Comments?
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Pavel Stehule

Date:

02 August 2009, 14:55:21

2009/8/2 Tom Lane <tgl@sss.pgh.pa.us>:
> There wasn't any response to this comment:
>
> marcin mank <marcin.mank@gmail.com> writes:
>> ALTER COLUMN SET DISTINCT
>> feels like adding a unique constraint.
>
>> ALTER COLUMN SET STATISTICS DISTINCT ?
>> ALTER COLUMN SET STATISTICS NDISTINCT ?
>
> I don't want to add a new keyword, but "SET STATISTICS DISTINCT" would
> be an easy change.  Comments?

It's much better than SET DISTINCT.

Pavel

>
>                        regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Joshua Tolley

Date:

02 August 2009, 17:19:00

On Sun, Aug 02, 2009 at 01:23:27PM -0400, Tom Lane wrote:
> There wasn't any response to this comment:
>
> marcin mank <marcin.mank@gmail.com> writes:
> > ALTER COLUMN SET DISTINCT
> > feels like adding a unique constraint.
>
> > ALTER COLUMN SET STATISTICS DISTINCT ?
> > ALTER COLUMN SET STATISTICS NDISTINCT ?
>
> I don't want to add a new keyword, but "SET STATISTICS DISTINCT" would
> be an easy change.  Comments?

+1

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

02 August 2009, 19:18:38

Dimitri Fontaine <dfontaine@hi-media.com> writes:
> Yes, and as I didn't have the time to install filterdiff I've attached  
> a revision of your patch in git format which adresses the problem I  
> mentioned, in a tarball also containing raw notes, tests, and  
> regression.{out,diffs}.

Applied with assorted corrections, plus the just-discussed change of
syntax to SET STATISTICS DISTINCT.

> mqke check is failing on opr_sanity test in what looks like an  
> ordering issue, but I didn't feel confident enough to adapt the .out  
> to force the regression into passing.

Hmm, I saw no such change here, so I left the regression test alone.
It might be a platform-specific behavior, in which case the buildfarm
will soon tell us.  If so, adding an ORDER BY to that query seems
like the right fix.
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

02 August 2009, 21:32:01

On Sun, Aug 2, 2009 at 6:17 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Dimitri Fontaine <dfontaine@hi-media.com> writes:
>> Yes, and as I didn't have the time to install filterdiff I've attached
>> a revision of your patch in git format which adresses the problem I
>> mentioned, in a tarball also containing raw notes, tests, and
>> regression.{out,diffs}.
>
> Applied with assorted corrections, plus the just-discussed change of
> syntax to SET STATISTICS DISTINCT.

Thanks.  The changes all look good - except I'm curious why %g vs. %f?

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

02 August 2009, 21:36:09

Robert Haas <robertmhaas@gmail.com> writes:
> Thanks.  The changes all look good - except I'm curious why %g vs. %f?

So as not to add ".000000" unnecessarily.  Positive values for ndistinct
should be treated as integers.  (I considered adding an error check to
reject values like 20.5, but refrained...)
        regards, tom lane

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Robert Haas

Date:

02 August 2009, 22:01:38

On Sun, Aug 2, 2009 at 8:35 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Thanks.  The changes all look good - except I'm curious why %g vs. %f?
>
> So as not to add ".000000" unnecessarily.  Positive values for ndistinct
> should be treated as integers.  (I considered adding an error check to
> reject values like 20.5, but refrained...)

Oh, I see.  That makes sense.

I think we do entirely too much forcing things to integers in the
query planner as it is.  The fact that a value can't truly be
fractional doesn't mean that an estimate of the value can't be
fractional.

Now, in this particular case, it seems hard to imagine that 20.5 is a
very useful value.  But that's not really our problem: we just need to
reject illegal values, not useless ones.

I'm interested to see how useful this proves to be in the field.  I
implemented it mostly on a whim because you and others seemed to think
it could have some value, and because I get an unhealthy amount of
personal satisfaction out of writing code during my spare time.  But
the real test will be to see whether the real users who were getting
bad query plans as a result of poor ndistinct estimates are able to
make practical use of this feature to get better ones.

...Robert

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Kim Bisgaard

Date:

03 August 2009, 04:18:23

Are there plans to make a small follow-up patch to make
CREATE UNIQUE INDEX on one column
(and variants in CREATE TABLE ... PRIMARY KEY) automatically do
SET STATISTICS DISTINCT?

It might not be as perfect a solution as teaching the planner to know 
about unique indexes, but it is better than nothing.

Good work!

Regards,
Kim

Re: ALTER TABLE ... ALTER COLUMN ... SET DISTINCT

From

Tom Lane

Date:

03 August 2009, 11:18:16

Kim Bisgaard <kim+pg@alleroedderne.adsl.dk> writes:
> Are there plans to make a small follow-up patch to make
> CREATE UNIQUE INDEX on one column
> (and variants in CREATE TABLE ... PRIMARY KEY) automatically do
> SET STATISTICS DISTINCT?

No.  It would be pointless for a single column and wrong for
multiple columns.
        regards, tom lane