Thread: Re: [HACKERS] Counting bool flags in a complex query
On Thu, 15 Jul 1999, Tom Lane wrote: > Michael Richards <miker@scifair.acadiau.ca> writes: > > I'm not sure this is correct, but I think I see a bug of some sort... > > I committed a fix last night; it will be in 6.5.1. I've found what I believe is another set of bugs: This is my monster query again... My folder numbers are: negative numbers are system folders such as New mail, trash, drafts and sentmail. I wanted to order the tuples so that the folderids were sorted from -1 to -4, then 1 to x. This way the system folders would always appear first in the list. This may not be valid SQL, as none of my books mention it. Is it possible to order by an expression? Here are some examples which some some odd behaviour. My suspected bug findings are at the end: SELECT folderid,foldername,count(*) as "messgaes",sum(bool2int(flagnew)) as "newmessages",sum(contentlength) as "size" FROM usermail,folders WHERE usermail.loginid='michael' and folders.loginid=usermail.loginid AND usermail.folder = folders.folderid GROUP BY folderid,foldername UNION SELECT folderid,foldername,0,0,0 FROM folders WHERE loginid='michael' AND NOT EXISTS (SELECT folder FROM usermail WHERE loginid='michael' AND folder=folderid) order by (folderid>0); folderid|foldername |messgaes|newmessages| size --------+----------------+--------+-----------+------- -4|Deleted Messages| 110| 50| 245627 -2|Sent Mail | 7| 2| 10878 -1|New Mail Folder | 73| 1|8831226 1|OOL | 7| 0| 8470 2|suggestions | 26| 0| 35433 3|Acadia | 5| 0| 17703 4|advertising | 4| 2| 5394 5|dealt with | 3| 0| 2883 36|dauphne | 9| 0| 66850 -3|Saved Drafts | 0| 0| 0 (10 rows) It looks like the order by is only being applied to the original select, not the unioned select. Some authority should check on it, but by thought it that a union does not necessarily maintain the order, so the entire select should be applied to the order. I'm not so good at interpreting the query plan, but here it is: Unique (cost=8.10 rows=0 width=0) -> Sort (cost=8.10 rows=0 width=0) -> Append (cost=8.10 rows=0 width=0) -> Aggregate (cost=6.05 rows=1 width=49) -> Group (cost=6.05 rows=1 width=49) -> Sort (cost=6.05 rows=1 width=49) -> Nested Loop (cost=6.05 rows=1 width=49) -> Index Scan using usermail_pkey on usermail (cost=2.05 rows=2 width=21) -> Index Scan using folders_pkey on folders (cost=2.00 rows=8448 width=28) -> Index Scan using folders_pkey on folders (cost=2.05 rows=2 width=16) SubPlan ->Index Scan using usermail_pkey on usermail (cost=2.05 rows=1 width=4) I would have expected the folderid -3 to appear as the 3rd one in this case. I'm probably going to change the numbering scheme of the system folders so they will sort correctly without a kluge such as: create function ordfolderid(int) returns int as 'select $1*-1 where $1<0 union select $1+1*10 where $1>=0' language 'sql'; Then running the order clause as: order by (folderid<0),ordfolderid(folderid) My thought behind this kludge is that the table should first be ordered by the t/f value of the fact folderid<0, then within each of the true and false sortings, subsort those by the value of folderid. Complicated enough for you? Well, in my playing I notice what appears to be more of a bug... SELECT folderid,foldername,count(*) as "messages",sum(bool2int(flagnew)) as "newmessages",sum(contentlength) as "size" FROM usermail,folders WHERE usermail.loginid='michael' and folders.loginid=usermail.loginid AND usermail.folder = folders.folderid GROUP BY folderid,foldername UNION SELECT folderid,foldername,0,0,0 FROM folders WHERE loginid='michael' AND NOT EXISTS (SELECT folder FROM usermail WHERE loginid='michael' AND folder=folderid) order by (folderid<0); folderid|foldername |messgaes|newmessages| size --------+----------------+--------+-----------+------- 1|OOL | 7| 0| 8470 2|suggestions | 26| 0| 35433 3|Acadia | 5| 0| 17703 4|advertising | 4| 2| 5394 5|dealt with | 3| 0| 2883 36|dauphne | 9| 0| 66850 -4|Deleted Messages| 110| 50| 245627 -2|Sent Mail | 7| 2| 10878 -1|New Mail Folder | 73| 1|8831226 -3|Saved Drafts | 0| 0| 0 (10 rows) SELECT folderid,foldername,count(*) as "messages",sum(bool2int(flagnew)) as "newmessages",sum(contentlength) as "size" FROM usermail,folders WHERE usermail.loginid='michael' and folders.loginid=usermail.loginid AND usermail.folder = folders.folderid GROUP BY folderid,foldername UNION SELECT folderid,foldername,0,0,0 FROM folders WHERE loginid='michael' AND NOT EXISTS (SELECT folder FROM usermail WHERE loginid='michael' AND folder=folderid) order by (messages<10); ERROR: attribute 'messages' not found Using a column name within an expression in the order by does not seem to work... Or a much simpler example to illustrate the bug: fastmail=> select 1 as "test" order by (test<9); ERROR: attribute 'test' not found fastmail=> select 1 as "test" order by test; test ---- 1 (1 row) I was almost able to make it work properly aside from the sorting issue with my kludged up routine... This is so nasty that I most definitely don't want to put it into production: SELECT folderid,foldername,count(*) as "messages",sum(bool2int(flagnew)) as "newmessages",sum(contentlength) as "size",(folderid>=0) FROM usermail,folders WHERE usermail.loginid='michael' and folders.loginid=usermail.loginid AND usermail.folder = folders.folderid GROUP BY folderid,foldername UNION SELECT folderid,foldername,0,0,0,(folderid>=0) FROM folders WHERE loginid='michael' AND NOT EXISTS (SELECT folder FROM usermail WHERE loginid='michael' AND folder=folderid) order by 6,ordfolderid(folderid); folderid|foldername |messages|newmessages| size|?column? --------+----------------+--------+-----------+-------+-------- -1|New Mail Folder | 73| 1|8831226|f -2|Sent Mail | 7| 2| 10878|f -4|Deleted Messages| 110| 50| 245627|f -3|Saved Drafts | 0| 0| 0|f 1|OOL | 7| 0| 8470|t 2|suggestions | 26| 0| 35433|t 3|Acadia | 5| 0| 17703|t 4|advertising | 4| 2| 5394|t 5|dealt with | 3| 0| 2883|t 36|dauphne | 9| 0| 66850|t (10 rows) Do I need outer joins to make this work instead of the screwed up union method I'm trying here, or is it just a series of bugs? -Michael
Michael Richards <miker@scifair.acadiau.ca> writes: > I've found what I believe is another set of bugs: I can shed some light on these. > This may not be valid SQL, as none of my books mention it. Is it possible > to order by an expression? Postgres accepts expressions as ORDER BY clauses, although strict SQL92 only allows sorting by a column name or number. > It looks like the order by is only being applied to the original select, > not the unioned select. Some authority should check on it, but by thought > it that a union does not necessarily maintain the order, so the entire > select should be applied to the order. That looks like a bug to me too --- I think the ORDER BY is supposed to apply across the whole UNION result. Will look into it. > I'm probably going to change the numbering scheme of the system folders so > they will sort correctly without a kluge such as: Good plan. Although you could sort by a user-defined function result, it's likely to be horribly slow (because user-defined functions are slow:-(). > Using a column name within an expression in the order by does not seem to > work... > Or a much simpler example to illustrate the bug: > fastmail=> select 1 as "test" order by (test<9); > ERROR: attribute 'test' not found This is not so much a bug as a definitional issue. For SQL92 compatibility, we accept ORDER BY a column label so long as it's a bare column label, but column labels are NOT part of the namespace for full expression evaluation. You can't do this either: select 1 as "test" , test<9 ; ERROR: attribute 'test' not found There are all sorts of squirrely questions about this feature IMHO. For example, create table z1 (f1 int4, f2 int4); CREATE select f1 as f2, f2 from z1 order by f2; f2|f2 --+-- (0 rows) Which column do you think it's ordering by? Which column *should* it order by? I think this ought to draw an "ambiguous column label" error ... there is code in there that claims to be looking for such a thing, in fact, so I am not quite sure why it doesn't trigger on this example. regards, tom lane
On Fri, 16 Jul 1999, Tom Lane wrote: > Good plan. Although you could sort by a user-defined function result, > it's likely to be horribly slow (because user-defined functions are > slow:-(). Yes, but I did include my horrible design ideas so you could see why in "god's name" I was trying to do what I was trying to do when I found what looked to be a "bug" > This is not so much a bug as a definitional issue. For SQL92 > compatibility, we accept ORDER BY a column label so long as it's > a bare column label, but column labels are NOT part of the namespace > for full expression evaluation. You can't do this either: > > select 1 as "test" , test<9 ; > ERROR: attribute 'test' not found > > There are all sorts of squirrely questions about this feature IMHO. > For example, > > create table z1 (f1 int4, f2 int4); > CREATE > select f1 as f2, f2 from z1 order by f2; > f2|f2 > --+-- > (0 rows) > > Which column do you think it's ordering by? Which column *should* it > order by? I think this ought to draw an "ambiguous column label" error > ... there is code in there that claims to be looking for such a thing, > in fact, so I am not quite sure why it doesn't trigger on this example. Good point. Is there anything in the SQL standard that defined how this "is supposed" to work? I suppose with no expression support it isn't really necessary. How about requiring quotes when we're to look at it was "named" columns? If select f1 as f2, f2 from z1 order by "f2"; Of course I have no idea how this would conflicy with SQL-92. It's more of an idea... -Michael
Michael Richards <miker@scifair.acadiau.ca> writes: >> For example, >> >> create table z1 (f1 int4, f2 int4); >> CREATE >> select f1 as f2, f2 from z1 order by f2; >> f2|f2 >> --+-- >> (0 rows) >> >> Which column do you think it's ordering by? Which column *should* it >> order by? I think this ought to draw an "ambiguous column label" error > Good point. Is there anything in the SQL standard that defined how this > "is supposed" to work? After looking at the SQL spec I think the above definitely ought to draw an error. We have the following verbiage concerning the column names for the result of a SELECT: a) If the i-th <derived column> in the <select list> specifies an <as clause> that contains a <columnname> C, then the <column name> of the i-th column of the result is C. b) If the i-th <derived column> in the <select list> does not specify an <as clause> and the <valueexpression> of that <derived column> is a single <column reference>, then the <column name>of the i-th column of the result is C. c) Otherwise, the <column name> of the i-th column of the <query specification> is implementation-dependentand different from the <column name> of any column, other than itself, of a table referenced by any <table reference> contained in the SQL-statement. which Postgres does indeed follow, and we see from (a) and (b) that "f2" is the required column name for both columns of the SELECT result. Now ORDER BY says a) If a <sort specification> contains a <column name>, then T shall contain exactly one column withthat <column name> and ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the <sort specification> identifies thatcolumn. which sure looks to me like it mandates an error for the example statement. However, since SQL doesn't consider the possibility of expressions as ORDER BY entries, we are more or less on our own for those. An expression appearing in the target list of a SELECT is not allowed to refer to columns by their "AS" names (and this does seem to be mandated by SQL92). So I think it makes sense to carry over the same restriction to ORDER BY. regards, tom lane
> > Good plan. Although you could sort by a user-defined function result, > > it's likely to be horribly slow (because user-defined functions are > > slow:-(). What is the actual overhead from using a userdefined function (a C function, say)? Obviously there is an overhead involved in calling any function. I would like to know what kind of issues are involved when using results of a user defined function for sorting. Are the results calculated only once, as one would expect, for example. Thanks, Troy Troy Korjuslommi Tksoft OY, Inc. tjk@tksoft.com Software Development Open Source Solutions Hosting Services
At 11:37 +0300 on 16/07/1999, Michael Richards wrote: > My folder numbers are: negative numbers are system folders such as New > mail, trash, drafts and sentmail. I wanted to order the tuples so that the > folderids were sorted from -1 to -4, then 1 to x. This way the system > folders would always appear first in the list. > > This may not be valid SQL, as none of my books mention it. Is it possible > to order by an expression? > > Here are some examples which some some odd behaviour. My suspected bug > findings are at the end: I think the problem results from using non-standard constructs such as order by expression, and indeed ordering by columns that don't appear in the select list. If you want to do the best by yourself, put the expression by which you order in the select list. A simple example would be: Instead of: SELECT f1, min( f2 ), max ( f3 ) GROUP BY f1 ORDER BY expr( f1 ); Use: SELECT expr( f1 ) AS ordcol, f1, min( f2 ), max( f3 ) GROUP BY ordcol, f1 ORDER BY ordcol; What is the difference? The difference is that now GROUP BY (which also does internal sorting) knows about that expression and considers it. Since ordcol is the same for each value of f1, this should not change the groups. This simply makes sure all parts of the query are aware of what is being done around them. This is also the standard, as far as I recall. What's the problem? You have a column in the output that you didn't really want. But hey, why should that bother you? If you're reading it through some frontend, simply have it ignore the first column that returns. Herouth -- Herouth Maoz, Internet developer. Open University of Israel - Telem project http://telem.openu.ac.il/~herutma
Herouth Maoz <herouth@oumail.openu.ac.il> writes: > I think the problem results from using non-standard constructs such as > order by expression, and indeed ordering by columns that don't appear in > the select list. No, that's not the problem. I looked into this last night, but hadn't got round to reporting my findings to the mail list. Richards has indeed tripped over a plain garden-variety bug, I think. ORDER BY clauses are set up to represent their targets by hard-coded pointers to Resdom nodes in the target list, which is an incredibly fragile representation, and I suspect it is failing to survive the slicing dicing and mangling of the parsetree that happens during UNION processing. (The final output plan has Resdoms that are copies of the originals, and the ORDER BY nodes are not pointing at the same copies as are in the tlist, so any change to the tlist Resdoms blows the sort...) We need to change this to be more like GROUP BY is currently done, with an ID number that's actually stored in the Resdoms so that the tlist can be copied without breaking it. I'm afraid that means no fix for 6.5.*, but I will try to fix it soon for 6.6. > [ example of not using hidden target list entries snipped ] > What is the difference? The difference is that now GROUP BY (which also > does internal sorting) knows about that expression and considers it. Richards' example would still fail if he did that. He might have better luck using UNION ALL instead of UNION --- I think that the specific problem is that the DISTINCT implied by UNION requires a sort, which is not getting merged with the user-given toplevel ORDER BY correctly as a result of Resdom breakage. We have had problems in the past with various bits of code failing to ignore the hidden tlist entries ("resjunk" targets) that are added to support ORDER BY or GROUP BY values that are not in the given tlist. I suppose Herouth is suspicious of the feature because he's been burnt by some of those bugs. They are getting flushed out, though, gradually... regards, tom lane
Quite awhile ago, Michael Richards <miker@scifair.acadiau.ca> wrote: > It looks like the order by is only being applied to the original select, > not the unioned select. Some authority should check on it, but by thought > it that a union does not necessarily maintain the order, so the entire > select should be applied to the order. Just FYI, I have committed code for 7.1 that allows ORDER BY to work correctly for a UNION'd query. A limitation is that you can only do ordering on columns that are outputs of the UNION: regression=# select q1 from int8_tbl union select q2 from int8_tbl order by 1; q1 --------------------4567890123456789 123 456 4567890123456789 (4 rows) regression=# select q1 from int8_tbl union select q2 from int8_tbl order by int8_tbl.q1+1; ERROR: ORDER BY on a UNION/INTERSECT/EXCEPT result must be on one of the result columns In the general case of an arbitrary ORDER BY expression, it's not clear how to transpose it into each UNION source select anyway. It could be made to work for expressions using only the output columns, but since ORDER BY expressions are not standard SQL I'm not in a big hurry to make that happen... regards, tom lane
Tom, > Just FYI, I have committed code for 7.1 that allows ORDER > BY to work > correctly for a UNION'd query. A limitation is that you > can only do > ordering on columns that are outputs of the UNION: As far as I know, that limitation is standard to all SQL that supports UNION; the relational calculus (I'm told) is impossible otherwise. So ... we keep hearing about all the fantastic fixes in 7.1. When will a stable build show up? :-) -Josh
"Josh Berkus" <josh@agliodbs.com> writes: >> Just FYI, I have committed code for 7.1 that allows ORDER BY to work >> correctly for a UNION'd query. A limitation is that you can only do >> ordering on columns that are outputs of the UNION: > As far as I know, that limitation is standard to all SQL > that supports UNION; the relational calculus (I'm told) is > impossible otherwise. It's not very reasonable to imagine ordering on arbitrary expressions; how would you interpret the expression in each sub-SELECT? But it's reasonable to imagine ordering on expressions that use only the output columns of the UNION-type query: SELECT q1, q2 FROM tbl1 UNION SELECT ... ORDER BY q1+q2; However, I didn't try to implement this yet. > So ... we keep hearing about all the fantastic fixes in 7.1. > When will a stable build show up? :-) How stable is stable? I'd say it's plenty stable enough for beta testing now, even though we're not putting out formal beta releases quite yet. You could grab a nightly snapshot off the FTP server if you want to try it. (Beware that you will most likely have to do another initdb before beta, so loading lots and lots of data into a snapshot installation is probably a waste of time.) regards, tom lane