Thread: Re: Distinct + Limit
Hi list, I have the following table with millions of rows: CREATE TABLE table1 ( col1 text, col2 text, col3 text, col4 text, col5 text, col6 text ) select col1 from table1 group by col1 limit 1; select distinct on (col1) col1 from table1 limit 1; select col1 from table1 group by col1 limit 2; select distinct on (col1) col1 from table1 limit 2; Performing any of these following queries results in a full sequential scan, followed by a hash aggregate, and then the limit. An optimization could be to stop the sequential scan as soon as the limit of results has been reached. Am I missingsomething? Limit (cost=2229280.06..2229280.08 rows=2 width=8) -> HashAggregate (cost=2229280.06..2229280.21 rows=15 width=8) -> Seq Scan on table1 (cost=0.00..2190241.25 rows=15615525 width=8) Similarly, the following query results in a sequential scan: select * from table1 where col1 <> col1; This query is generated by the Sequel library abstraction layer in Ruby when filtering record based on a empty array of values.We fixed this by handling that case on the client side, but originally thought the server would have rewritten itand sent a empty result set. I would greatly appreciate any help on speeding up these without having to rewrite the queries on the client side. Thanks, Francois
On Tue, Mar 27, 2012 at 11:54 PM, Francois Deliege <fdeliege@gmail.com> wrote: > select col1 from table1 group by col1 limit 1; > select distinct on (col1) col1 from table1 limit 1; > > select col1 from table1 group by col1 limit 2; > select distinct on (col1) col1 from table1 limit 2; > > Performing any of these following queries results in a full sequential scan, followed by a hash aggregate, and then thelimit. An optimization could be to stop the sequential scan as soon as the limit of results has been reached. Am I missingsomething? Yes, that would be an optimization. Unfortunately currently the aggregation logic doesn't have special case logic to start outputting tuples immediately when no aggregate functions are in use. In principle it's possible to teach it to do that, peeking at the code it seems that it wouldn't even be too hard to implement. Currently your best options are to add an indexes for columns that you select distinct values from, use a server side cursor and do the distinct operation on the client (might need to adjust cursor_tuple_fraction before doing the query to make cost estimates better) or use a stored procedure to do the cursor + manual distinct trick. > Similarly, the following query results in a sequential scan: > > select * from table1 where col1 <> col1; PostgreSQL query optimizer doesn't try to be a theorem prover and so doesn't deduce the logical impossibility. For most queries, looking for nonsensical would be a complete waste of time. The optimizer does notice impossibilities that crop up during constant propagation, so WHERE false or WHERE 0 = 1 would work fine. It would be best to fix Sequel to output literal constant false for PostgreSQL. However, I wonder if it's worth checking for this very specific case because it is a common idiom for Oracle users to implement constant false in where predicates due to Oracle not allowing top level literal booleans for some arcane reason or another. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Francois Deliege <fdeliege@gmail.com> writes: > I have the following table with millions of rows: > CREATE TABLE table1 > ( > col1 text, > col2 text, > col3 text, > col4 text, > col5 text, > col6 text > ) > select col1 from table1 group by col1 limit 1; > select distinct on (col1) col1 from table1 limit 1; > select col1 from table1 group by col1 limit 2; > select distinct on (col1) col1 from table1 limit 2; > Performing any of these following queries results in a full sequential scan, followed by a hash aggregate, and then the limit. Well, if you had an index on the column, you would get a significantly better plan ... > Similarly, the following query results in a sequential scan: > select * from table1 where col1 <> col1; > This query is generated by the Sequel library abstraction layer in Ruby when filtering record based on a empty array ofvalues. We fixed this by handling that case on the client side, but originally thought the server would have rewrittenit and sent a empty result set. It does not, and never will, because that would be an incorrect optimization. "col1 <> col1" isn't constant false, it's more like "col1 is not null". I'd suggest "WHERE FALSE", or "WHERE 1 <> 1" if you must, to generate a provably false constraint. regards, tom lane
On Wed, Mar 28, 2012 at 9:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Francois Deliege <fdeliege@gmail.com> writes: >> I have the following table with millions of rows: > >> CREATE TABLE table1 >> ( >> col1 text, >> col2 text, >> col3 text, >> col4 text, >> col5 text, >> col6 text >> ) > >> select col1 from table1 group by col1 limit 1; >> select distinct on (col1) col1 from table1 limit 1; > >> select col1 from table1 group by col1 limit 2; >> select distinct on (col1) col1 from table1 limit 2; > >> Performing any of these following queries results in a full sequential > scan, followed by a hash aggregate, and then the limit. > > Well, if you had an index on the column, you would get a significantly > better plan ... > >> Similarly, the following query results in a sequential scan: > >> select * from table1 where col1 <> col1; > >> This query is generated by the Sequel library abstraction layer in Ruby when filtering record based on a empty array ofvalues. We fixed this by handling that case on the client side, but originally thought the server would have rewrittenit and sent a empty result set. > > It does not, and never will, because that would be an incorrect > optimization. "col1 <> col1" isn't constant false, it's more like > "col1 is not null". I'd suggest "WHERE FALSE", or "WHERE 1 <> 1" > if you must, to generate a provably false constraint. 'col1 is distinct from col1' could be optimized like that. all though it would be pretty hard to imagine a use case for it. merlin