Thread: distinct. Is this the correct behaviour?

distinct. Is this the correct behaviour?

From

Vince Vielhaber

Date:

20 October 1999, 22:57:16

Is this the way distinct is supposed to work?  My intent is to give
only one for each different value of x - like it does in the first
distinct example.  But when order by is added for the date/time sort
I get what you see in the second distinct example.

pop4=> select * from foo;
x|y                           
-+----------------------------
1|Wed Oct 20 06:29:41 1999 EDT
1|Wed Oct 20 06:29:42 1999 EDT
1|Wed Oct 20 06:29:43 1999 EDT
1|Wed Oct 20 06:29:48 1999 EDT
(4 rows)

pop4=> select distinct x from foo;
x
-
1
(1 row)

pop4=> select distinct x from foo order by y;
x
-
1
1
1
1
(4 rows)



Vince.
-- 
==========================================================================
Vince Vielhaber -- KA8CSH   email: vev@michvhf.com   flame-mail: /dev/null # include <std/disclaimers.h>       Have you
seenhttp://www.pop4.net?       Online Campground Directory    http://www.camping-usa.com      Online Giftshop
Superstore   http://www.cloudninegifts.com
 
==========================================================================

Re: [HACKERS] distinct. Is this the correct behaviour?

From

Tom Lane

Date:

21 October 1999, 00:11:17

Vince Vielhaber <vev@michvhf.com> writes:
> Is this the way distinct is supposed to work?  My intent is to give
> only one for each different value of x - like it does in the first
> distinct example.  But when order by is added for the date/time sort
> I get what you see in the second distinct example.

Yeah, I think it's a bug too.  It's not quite clear what to change,
though.

The "problem" is that nodeUnique is doing a bitwise compare across the
whole tuple, including the hidden ('junk') y column that is needed to do
the sorting.  So, because you have four different y values, you get four
rows out.

However, if we fix nodeUnique to ignore junk columns, then the result
becomes nondeterministic.  Consider
x    y
1    31    52    4

If we do "select distinct x from foo order by y" on this data, then the
order of the result depends on which of the two tuples with x=1 happens
to get chosen by the Unique filter.  This is not good.

SQL92 gets around this by allowing ORDER BY only on columns of the
targetlist, so that you are not allowed to specify this query in the
first place.

I think it is useful to allow ORDER BY on hidden columns, but maybe we
need to forbid it when DISTINCT is present.  If we do that then the
implementation of nodeUnique is OK as it stands, and the bug is that
the parser accepts an invalid query.

This is pretty closely related to the semantic problems of DISTINCT ON,
once you see that the trouble is having columns in the query that aren't
being used for (or aren't supposed to be used for) the DISTINCT check.
        regards, tom lane

Re: [HACKERS] distinct. Is this the correct behaviour?

From

Vince Vielhaber

Date:

21 October 1999, 00:48:17

On 21-Oct-99 Tom Lane wrote:
> Vince Vielhaber <vev@michvhf.com> writes:
>
> If we do "select distinct x from foo order by y" on this data, then the
> order of the result depends on which of the two tuples with x=1 happens
> to get chosen by the Unique filter.  This is not good.

What seems logical to me tho is that it should first select all of the
x cols that are equal, put them in y order, then pick the first one for
the distinct.  At least this is the behaviour that I'm looking for; perhaps
I'm going to need to make a more complex call.  I also wonder how other
RDBMS are handling it...  I'll hafta see what sybase does (if I remember
*and* get a chance).
> SQL92 gets around this by allowing ORDER BY only on columns of the
> targetlist, so that you are not allowed to specify this query in the
> first place.

I can understand the reason, yet also fail to understand.

> I think it is useful to allow ORDER BY on hidden columns, but maybe we
> need to forbid it when DISTINCT is present.  If we do that then the
> implementation of nodeUnique is OK as it stands, and the bug is that
> the parser accepts an invalid query.
> 
> This is pretty closely related to the semantic problems of DISTINCT ON,
> once you see that the trouble is having columns in the query that aren't
> being used for (or aren't supposed to be used for) the DISTINCT check.

Ok, well what I'm trying to do is write a web-based discussion forum.  I
wanted to list the subjects in any particular forum, but also want them
to be in the order in which they were first posted.  So if I have 10 
comments on one subject which first started last month and the subject
begun with a 'z', and another that was started today with the subject
beginning with an 'A', I want the end result to be:

zebras have stripes
Always cross at the light

as opposed to 10 lines about the zebras and only one on Always.  It
seems elementary, but at the same time it seems complex.  Must mean
it's time for bed.
Vince.
-- 
==========================================================================
Vince Vielhaber -- KA8CSH   email: vev@michvhf.com   flame-mail: /dev/null # include <std/disclaimers.h>       Have you
seenhttp://www.pop4.net?       Online Campground Directory    http://www.camping-usa.com      Online Giftshop
Superstore   http://www.cloudninegifts.com
 
==========================================================================

Re: [HACKERS] distinct. Is this the correct behaviour?

From

sszabo@bigpanda.com

Date:

21 October 1999, 01:41:18

This seems to generally work in postgres for the simple cases I tried:select x from foo group by x order by min(y)

Now I don't know if there are any hidden gotchas in that (or wierdness
with the spec), but it also feels better to me than using distinct in this
case as well, because it seems to explicitly describe how you want y
ordered.

>Ok, well what I'm trying to do is write a web-based discussion forum.  I
>wanted to list the subjects in any particular forum, but also want them
>to be in the order in which they were first posted.  So if I have 10 
>comments on one subject which first started last month and the subject
>begun with a 'z', and another that was started today with the subject
>beginning with an 'A', I want the end result to be:
>
>zebras have stripes
>Always cross at the light
>
>as opposed to 10 lines about the zebras and only one on Always.  It
>seems elementary, but at the same time it seems complex.  Must mean
>it's time for bed.

Re: [HACKERS] distinct. Is this the correct behaviour?

From

Vince Vielhaber

Date:

21 October 1999, 08:53:23

On Wed, 20 Oct 1999 sszabo@bigpanda.com wrote:

> 
> This seems to generally work in postgres for the simple cases I tried:
>  select x from foo group by x order by min(y)
> 
> Now I don't know if there are any hidden gotchas in that (or wierdness
> with the spec), but it also feels better to me than using distinct in this
> case as well, because it seems to explicitly describe how you want y
> ordered.

Hey!  That works!  Thanks!  Just checked sybase and the original query
acts identical to ours.

Vince.
-- 
==========================================================================
Vince Vielhaber -- KA8CSH   email: vev@michvhf.com   flame-mail: /dev/null # include <std/disclaimers.h>       Have you
seenhttp://www.pop4.net?       Online Campground Directory    http://www.camping-usa.com      Online Giftshop
Superstore   http://www.cloudninegifts.com
 
==========================================================================

Re: [HACKERS] distinct. Is this the correct behaviour?

From

Tom Lane

Date:

21 October 1999, 12:35:25

Vince Vielhaber <vev@michvhf.com> writes:
> Just checked sybase and the original query
> acts identical to ours.

Hmph, so sybase hasn't thought through the implications of ORDER BY on
a hidden column vs. DISTINCT either.  Can anyone try it on some other
DBMSes?
        regards, tom lane

Re: [HACKERS] distinct. Is this the correct behaviour?

From

Tom Lane

Date:

21 October 1999, 12:38:26

sszabo@bigpanda.com writes:
> This seems to generally work in postgres for the simple cases I tried:
>  select x from foo group by x order by min(y)

> Now I don't know if there are any hidden gotchas in that (or wierdness
> with the spec), but it also feels better to me than using distinct in this
> case as well, because it seems to explicitly describe how you want y
> ordered.

Yes, I like that better too.

I wonder if we could/should rewrite all uses of DISTINCT into GROUP
BY under-the-hood...
        regards, tom lane