Thread: Re: [HACKERS] Counting bool flags in a complex query

Re: [HACKERS] Counting bool flags in a complex query

From
Michael Richards
Date:
On Thu, 15 Jul 1999, Tom Lane wrote:

> Michael Richards <miker@scifair.acadiau.ca> writes:
> > I'm not sure this is correct, but I think I see a bug of some sort...
> 
> I committed a fix last night; it will be in 6.5.1.

I've found what I believe is another set of bugs:
This is my monster query again...

My folder numbers are: negative numbers are system folders such as New
mail, trash, drafts and sentmail. I wanted to order the tuples so that the
folderids were sorted from -1 to -4, then 1 to x. This way the system
folders would always appear first in the list.

This may not be valid SQL, as none of my books mention it. Is it possible
to order by an expression?

Here are some examples which some some odd behaviour. My suspected bug
findings are at the end:

SELECT folderid,foldername,count(*) as "messgaes",sum(bool2int(flagnew))
as "newmessages",sum(contentlength) as "size" FROM usermail,folders WHERE
usermail.loginid='michael' and folders.loginid=usermail.loginid AND
usermail.folder = folders.folderid GROUP BY folderid,foldername UNION
SELECT folderid,foldername,0,0,0 FROM folders WHERE loginid='michael' AND
NOT EXISTS (SELECT folder FROM usermail WHERE loginid='michael' AND
folder=folderid) order by (folderid>0);
folderid|foldername      |messgaes|newmessages|   size
--------+----------------+--------+-----------+-------     -4|Deleted Messages|     110|         50| 245627     -2|Sent
Mail      |       7|          2|  10878     -1|New Mail Folder |      73|          1|8831226      1|OOL             |
   7|          0|   8470      2|suggestions     |      26|          0|  35433      3|Acadia          |       5|
0|  17703      4|advertising     |       4|          2|   5394      5|dealt with      |       3|          0|   2883
36|dauphne        |       9|          0|  66850     -3|Saved Drafts    |       0|          0|      0
 
(10 rows)

It looks like the order by is only being applied to the original select,
not the unioned select. Some authority should check on it, but by thought
it that a union does not necessarily maintain the order, so the entire
select should be applied to the order.

I'm not so good at interpreting the query plan, but here it is:
Unique  (cost=8.10 rows=0 width=0) ->  Sort  (cost=8.10 rows=0 width=0)   ->  Append  (cost=8.10 rows=0 width=0)     ->
Aggregate  (cost=6.05 rows=1 width=49)       ->  Group  (cost=6.05 rows=1 width=49)         ->  Sort  (cost=6.05 rows=1
width=49)          ->  Nested Loop  (cost=6.05 rows=1 width=49)             ->  Index Scan using usermail_pkey on
usermail (cost=2.05 rows=2 width=21)             ->  Index Scan using folders_pkey on folders  (cost=2.00 rows=8448
width=28)      -> Index Scan using folders_pkey on folders (cost=2.05 rows=2 width=16)            SubPlan
->Index Scan using usermail_pkey on usermail (cost=2.05 rows=1 width=4)
 

I would have expected the folderid -3 to appear as the 3rd one in this
case.

I'm probably going to change the numbering scheme of the system folders so
they will sort correctly without a kluge such as:
create function ordfolderid(int) returns int as 'select $1*-1 where $1<0
union select $1+1*10 where $1>=0' language 'sql';

Then running the order clause as: 
order by (folderid<0),ordfolderid(folderid)
My thought behind this kludge is that the table should first be ordered by
the t/f value of the fact folderid<0, then within each of the true and
false sortings, subsort those by the value of folderid.

Complicated enough for you?

Well, in my playing I notice what appears to be more of a bug...
SELECT folderid,foldername,count(*) as "messages",sum(bool2int(flagnew))
as "newmessages",sum(contentlength) as "size" FROM usermail,folders WHERE
usermail.loginid='michael' and folders.loginid=usermail.loginid AND
usermail.folder = folders.folderid GROUP BY folderid,foldername UNION
SELECT folderid,foldername,0,0,0 FROM folders WHERE loginid='michael' AND
NOT EXISTS (SELECT folder FROM usermail WHERE loginid='michael' AND
folder=folderid) order by (folderid<0);
folderid|foldername      |messgaes|newmessages|   size
--------+----------------+--------+-----------+-------      1|OOL             |       7|          0|   8470
2|suggestions    |      26|          0|  35433      3|Acadia          |       5|          0|  17703      4|advertising
  |       4|          2|   5394      5|dealt with      |       3|          0|   2883     36|dauphne         |       9|
       0|  66850     -4|Deleted Messages|     110|         50| 245627     -2|Sent Mail       |       7|          2|
10878    -1|New Mail Folder |      73|          1|8831226     -3|Saved Drafts    |       0|          0|      0
 
(10 rows)

SELECT folderid,foldername,count(*) as "messages",sum(bool2int(flagnew))
as "newmessages",sum(contentlength) as "size" FROM usermail,folders WHERE
usermail.loginid='michael' and folders.loginid=usermail.loginid AND
usermail.folder = folders.folderid GROUP BY folderid,foldername UNION
SELECT folderid,foldername,0,0,0 FROM folders WHERE loginid='michael' AND
NOT EXISTS (SELECT folder FROM usermail WHERE loginid='michael' AND
folder=folderid) order by (messages<10);
ERROR:  attribute 'messages' not found

Using a column name within an expression in the order by does not seem to
work...
Or a much simpler example to illustrate the bug:
fastmail=> select 1 as "test" order by (test<9);
ERROR:  attribute 'test' not found

fastmail=> select 1 as "test" order by test;
test
----  1
(1 row)


I was almost able to make it work properly aside from the sorting issue
with my kludged up routine... This is so nasty that I most definitely
don't want to put it into production:

SELECT folderid,foldername,count(*) as "messages",sum(bool2int(flagnew))
as "newmessages",sum(contentlength) as "size",(folderid>=0) FROM
usermail,folders WHERE usermail.loginid='michael' and
folders.loginid=usermail.loginid AND usermail.folder = folders.folderid
GROUP BY folderid,foldername UNION SELECT
folderid,foldername,0,0,0,(folderid>=0) FROM folders WHERE
loginid='michael' AND NOT EXISTS (SELECT folder FROM usermail WHERE
loginid='michael' AND folder=folderid) order by 6,ordfolderid(folderid);
folderid|foldername      |messages|newmessages|   size|?column?
--------+----------------+--------+-----------+-------+--------     -1|New Mail Folder |      73|          1|8831226|f
         -2|Sent Mail       |       7|          2|  10878|f            -4|Deleted Messages|     110|         50|
245627|f           -3|Saved Drafts    |       0|          0|      0|f             1|OOL             |       7|
0|  8470|t             2|suggestions     |      26|          0|  35433|t             3|Acadia          |       5|
  0|  17703|t             4|advertising     |       4|          2|   5394|t             5|dealt with      |       3|
     0|   2883|t            36|dauphne         |       9|          0|  66850|t       
 
(10 rows)

Do I need outer joins to make this work instead of the screwed up union
method I'm trying here, or is it just a series of bugs?

-Michael



Re: [HACKERS] Counting bool flags in a complex query

From
Tom Lane
Date:
Michael Richards <miker@scifair.acadiau.ca> writes:
> I've found what I believe is another set of bugs:

I can shed some light on these.

> This may not be valid SQL, as none of my books mention it. Is it possible
> to order by an expression?

Postgres accepts expressions as ORDER BY clauses, although strict SQL92
only allows sorting by a column name or number.

> It looks like the order by is only being applied to the original select,
> not the unioned select. Some authority should check on it, but by thought
> it that a union does not necessarily maintain the order, so the entire
> select should be applied to the order.

That looks like a bug to me too --- I think the ORDER BY is supposed to
apply across the whole UNION result.  Will look into it.

> I'm probably going to change the numbering scheme of the system folders so
> they will sort correctly without a kluge such as:

Good plan.  Although you could sort by a user-defined function result,
it's likely to be horribly slow (because user-defined functions are
slow:-().

> Using a column name within an expression in the order by does not seem to
> work...
> Or a much simpler example to illustrate the bug:
> fastmail=> select 1 as "test" order by (test<9);
> ERROR:  attribute 'test' not found

This is not so much a bug as a definitional issue.  For SQL92
compatibility, we accept ORDER BY a column label so long as it's
a bare column label, but column labels are NOT part of the namespace
for full expression evaluation.  You can't do this either:

select 1 as "test" , test<9 ;
ERROR:  attribute 'test' not found

There are all sorts of squirrely questions about this feature IMHO.
For example,

create table z1 (f1 int4, f2 int4);
CREATE
select f1 as f2, f2 from z1 order by f2;
f2|f2
--+--
(0 rows)

Which column do you think it's ordering by?  Which column *should* it
order by?  I think this ought to draw an "ambiguous column label" error
... there is code in there that claims to be looking for such a thing,
in fact, so I am not quite sure why it doesn't trigger on this example.
        regards, tom lane


Re: [HACKERS] Counting bool flags in a complex query

From
Michael Richards
Date:
On Fri, 16 Jul 1999, Tom Lane wrote:

> Good plan.  Although you could sort by a user-defined function result,
> it's likely to be horribly slow (because user-defined functions are
> slow:-().
Yes, but I did include my horrible design ideas so you could see why in
"god's name" I was trying to do what I was trying to do when I found what
looked to be a "bug"

> This is not so much a bug as a definitional issue.  For SQL92
> compatibility, we accept ORDER BY a column label so long as it's
> a bare column label, but column labels are NOT part of the namespace
> for full expression evaluation.  You can't do this either:
> 
> select 1 as "test" , test<9 ;
> ERROR:  attribute 'test' not found
> 
> There are all sorts of squirrely questions about this feature IMHO.
> For example,
> 
> create table z1 (f1 int4, f2 int4);
> CREATE
> select f1 as f2, f2 from z1 order by f2;
> f2|f2
> --+--
> (0 rows)
> 
> Which column do you think it's ordering by?  Which column *should* it
> order by?  I think this ought to draw an "ambiguous column label" error
> ... there is code in there that claims to be looking for such a thing,
> in fact, so I am not quite sure why it doesn't trigger on this example.

Good point. Is there anything in the SQL standard that defined how this
"is supposed" to work? I suppose with no expression support it isn't
really necessary. How about requiring quotes when we're to look at it was
"named" columns? If 
select f1 as f2, f2 from z1 order by "f2";

Of course I have no idea how this would conflicy with SQL-92. It's more of
an idea...

-Michael



Re: [SQL] Re: [HACKERS] Counting bool flags in a complex query

From
Tom Lane
Date:
Michael Richards <miker@scifair.acadiau.ca> writes:
>> For example,
>> 
>> create table z1 (f1 int4, f2 int4);
>> CREATE
>> select f1 as f2, f2 from z1 order by f2;
>> f2|f2
>> --+--
>> (0 rows)
>> 
>> Which column do you think it's ordering by?  Which column *should* it
>> order by?  I think this ought to draw an "ambiguous column label" error

> Good point. Is there anything in the SQL standard that defined how this
> "is supposed" to work?

After looking at the SQL spec I think the above definitely ought to draw
an error.  We have the following verbiage concerning the column names
for the result of a SELECT:
           a) If the i-th <derived column> in the <select list> specifies             an <as clause> that contains a
<columnname> C, then the             <column name> of the i-th column of the result is C.
 
           b) If the i-th <derived column> in the <select list> does not             specify an <as clause> and the
<valueexpression> of that             <derived column> is a single <column reference>, then the             <column
name>of the i-th column of the result is C.
 
           c) Otherwise, the <column name> of the i-th column of the <query             specification> is
implementation-dependentand different             from the <column name> of any column, other than itself, of
 a table referenced by any <table reference> contained in the             SQL-statement.
 

which Postgres does indeed follow, and we see from (a) and (b) that "f2"
is the required column name for both columns of the SELECT result.
Now ORDER BY says
           a) If a <sort specification> contains a <column name>, then T             shall contain exactly one column
withthat <column name> and             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^             the <sort specification> identifies
thatcolumn.
 

which sure looks to me like it mandates an error for the example
statement.

However, since SQL doesn't consider the possibility of expressions as
ORDER BY entries, we are more or less on our own for those.  An
expression appearing in the target list of a SELECT is not allowed to
refer to columns by their "AS" names (and this does seem to be mandated
by SQL92).  So I think it makes sense to carry over the same restriction
to ORDER BY.
        regards, tom lane


user defined function speeds

From
"tjk@tksoft.com"
Date:
> > Good plan.  Although you could sort by a user-defined function result,
> > it's likely to be horribly slow (because user-defined functions are
> > slow:-().


What is the actual overhead from using a userdefined
function (a C function, say)?

Obviously there is an overhead involved in calling
any function. I would like to know what kind of issues
are involved when using results of a user defined function for
sorting.

Are the results calculated only once, as one would expect,
for example.

Thanks,


Troy

Troy Korjuslommi                Tksoft OY, Inc.
tjk@tksoft.com                  Software Development
                                Open Source Solutions
                                Hosting Services

Re: [SQL] Re: [HACKERS] Counting bool flags in a complex query

From
Herouth Maoz
Date:
At 11:37 +0300 on 16/07/1999, Michael Richards wrote:


> My folder numbers are: negative numbers are system folders such as New
> mail, trash, drafts and sentmail. I wanted to order the tuples so that the
> folderids were sorted from -1 to -4, then 1 to x. This way the system
> folders would always appear first in the list.
>
> This may not be valid SQL, as none of my books mention it. Is it possible
> to order by an expression?
>
> Here are some examples which some some odd behaviour. My suspected bug
> findings are at the end:

I think the problem results from using non-standard constructs such as
order by expression, and indeed ordering by columns that don't appear in
the select list.

If you want to do the best by yourself, put the expression by which you
order in the select list. A simple example would be:

Instead of: SELECT f1, min( f2 ), max ( f3 ) GROUP BY f1 ORDER BY expr( f1 );

Use:
 SELECT expr( f1 ) AS ordcol, f1, min( f2 ), max( f3 ) GROUP BY ordcol, f1 ORDER BY ordcol;

What is the difference? The difference is that now GROUP BY (which also
does internal sorting) knows about that expression and considers it. Since
ordcol is the same for each value of f1, this should not change the groups.
This simply makes sure all parts of the query are aware of what is being
done around them. This is also the standard, as far as I recall.

What's the problem? You have a column in the output that you didn't really
want. But hey, why should that bother you? If you're reading it through
some frontend, simply have it ignore the first column that returns.

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma




Re: [SQL] Re: [HACKERS] Counting bool flags in a complex query

From
Tom Lane
Date:
Herouth Maoz <herouth@oumail.openu.ac.il> writes:
> I think the problem results from using non-standard constructs such as
> order by expression, and indeed ordering by columns that don't appear in
> the select list.

No, that's not the problem.  I looked into this last night, but hadn't
got round to reporting my findings to the mail list.  Richards has
indeed tripped over a plain garden-variety bug, I think.  ORDER BY
clauses are set up to represent their targets by hard-coded pointers
to Resdom nodes in the target list, which is an incredibly fragile
representation, and I suspect it is failing to survive the slicing
dicing and mangling of the parsetree that happens during UNION
processing.  (The final output plan has Resdoms that are copies of the
originals, and the ORDER BY nodes are not pointing at the same copies as
are in the tlist, so any change to the tlist Resdoms blows the sort...)

We need to change this to be more like GROUP BY is currently done,
with an ID number that's actually stored in the Resdoms so that the
tlist can be copied without breaking it.  I'm afraid that means no
fix for 6.5.*, but I will try to fix it soon for 6.6.

> [ example of not using hidden target list entries snipped ]
> What is the difference? The difference is that now GROUP BY (which also
> does internal sorting) knows about that expression and considers it.

Richards' example would still fail if he did that.  He might have better
luck using UNION ALL instead of UNION --- I think that the specific
problem is that the DISTINCT implied by UNION requires a sort, which is
not getting merged with the user-given toplevel ORDER BY correctly as a
result of Resdom breakage.

We have had problems in the past with various bits of code failing to
ignore the hidden tlist entries ("resjunk" targets) that are added
to support ORDER BY or GROUP BY values that are not in the given tlist.
I suppose Herouth is suspicious of the feature because he's been burnt
by some of those bugs.  They are getting flushed out, though,
gradually...
        regards, tom lane


Re: [HACKERS] Counting bool flags in a complex query

From
Tom Lane
Date:
Quite awhile ago, Michael Richards <miker@scifair.acadiau.ca> wrote:
> It looks like the order by is only being applied to the original select,
> not the unioned select. Some authority should check on it, but by thought
> it that a union does not necessarily maintain the order, so the entire
> select should be applied to the order.

Just FYI, I have committed code for 7.1 that allows ORDER BY to work
correctly for a UNION'd query.  A limitation is that you can only do
ordering on columns that are outputs of the UNION:

regression=# select q1 from int8_tbl union select q2 from int8_tbl order by 1;       q1
--------------------4567890123456789              123              456 4567890123456789
(4 rows)

regression=# select q1 from int8_tbl union select q2 from int8_tbl order by int8_tbl.q1+1;
ERROR:  ORDER BY on a UNION/INTERSECT/EXCEPT result must be on one of the result columns

In the general case of an arbitrary ORDER BY expression, it's not clear
how to transpose it into each UNION source select anyway.  It could
be made to work for expressions using only the output columns, but since
ORDER BY expressions are not standard SQL I'm not in a big hurry to make
that happen...
        regards, tom lane


Re: Re: [HACKERS] Counting bool flags in a complex query

From
"Josh Berkus"
Date:
Tom,

> Just FYI, I have committed code for 7.1 that allows ORDER
> BY to work
> correctly for a UNION'd query.  A limitation is that you
> can only do
> ordering on columns that are outputs of the UNION:

As far as I know, that limitation is standard to all SQL
that supports UNION; the relational calculus (I'm told) is
impossible otherwise.  

So ... we keep hearing about all the fantastic fixes in 7.1.
When will a stable build show up? :-)

-Josh


Re: Re: [HACKERS] Counting bool flags in a complex query

From
Tom Lane
Date:
"Josh Berkus" <josh@agliodbs.com> writes:
>> Just FYI, I have committed code for 7.1 that allows ORDER BY to work
>> correctly for a UNION'd query.  A limitation is that you can only do
>> ordering on columns that are outputs of the UNION:

> As far as I know, that limitation is standard to all SQL
> that supports UNION; the relational calculus (I'm told) is
> impossible otherwise.  

It's not very reasonable to imagine ordering on arbitrary expressions;
how would you interpret the expression in each sub-SELECT?  But it's
reasonable to imagine ordering on expressions that use only the
output columns of the UNION-type query:
SELECT q1, q2 FROM tbl1 UNION SELECT ...    ORDER BY q1+q2;

However, I didn't try to implement this yet.

> So ... we keep hearing about all the fantastic fixes in 7.1.
> When will a stable build show up? :-)

How stable is stable?  I'd say it's plenty stable enough for beta
testing now, even though we're not putting out formal beta releases
quite yet.  You could grab a nightly snapshot off the FTP server
if you want to try it.  (Beware that you will most likely have to
do another initdb before beta, so loading lots and lots of data
into a snapshot installation is probably a waste of time.)
        regards, tom lane