Thread: Re: Distinct + Limit

Re: Distinct + Limit

From
Francois Deliege
Date:
Hi list,

I have the following table with millions of rows:

CREATE TABLE table1
(
  col1 text,
  col2 text,
  col3 text,
  col4 text,
  col5 text,
  col6 text
)

select col1 from table1 group by col1 limit 1;
select distinct on (col1) col1 from table1 limit 1;

select col1 from table1 group by col1 limit 2;
select distinct on (col1) col1 from table1 limit 2;

Performing any of these following queries results in a full sequential scan, followed by a hash aggregate, and then the
limit. An optimization could be to stop the sequential scan as soon as the limit of results has been reached.  Am I
missingsomething? 

Limit  (cost=2229280.06..2229280.08 rows=2 width=8)
  ->  HashAggregate  (cost=2229280.06..2229280.21 rows=15 width=8)
        ->  Seq Scan on table1  (cost=0.00..2190241.25 rows=15615525 width=8)

Similarly, the following query results in a sequential scan:

select * from table1 where col1 <> col1;

This query is generated by the Sequel library abstraction layer in Ruby when filtering record based on a empty array of
values.We fixed this by handling that case on the client side, but originally thought the server would have rewritten
itand sent a empty result set. 

I would greatly appreciate any help on speeding up these without having to rewrite the queries on the client side.

Thanks,

Francois

Re: Distinct + Limit

From
Ants Aasma
Date:
On Tue, Mar 27, 2012 at 11:54 PM, Francois Deliege <fdeliege@gmail.com> wrote:
> select col1 from table1 group by col1 limit 1;
> select distinct on (col1) col1 from table1 limit 1;
>
> select col1 from table1 group by col1 limit 2;
> select distinct on (col1) col1 from table1 limit 2;
>
> Performing any of these following queries results in a full sequential scan, followed by a hash aggregate, and then
thelimit.  An optimization could be to stop the sequential scan as soon as the limit of results has been reached.  Am I
missingsomething? 

Yes, that would be an optimization. Unfortunately currently the
aggregation logic doesn't have special case logic to start outputting
tuples immediately  when no aggregate functions are in use. In
principle it's possible to teach it to do that, peeking at the code it
seems that it wouldn't even be too hard to implement.

Currently your best options are to add an indexes for columns that you
select distinct values from, use a server side cursor and do the
distinct operation on the client (might need to adjust
cursor_tuple_fraction before doing the query to make cost estimates
better) or use a stored procedure to do the cursor + manual distinct
trick.

> Similarly, the following query results in a sequential scan:
>
> select * from table1 where col1 <> col1;

PostgreSQL query optimizer doesn't try to be a theorem prover and so
doesn't deduce the logical impossibility. For most queries, looking
for nonsensical would be a complete waste of time. The optimizer does
notice impossibilities that crop up during constant propagation, so
WHERE false or WHERE 0 = 1 would work fine. It would be best to fix
Sequel to output literal constant false for PostgreSQL. However, I
wonder if it's worth checking for this very specific case because it
is a common idiom for Oracle users to implement constant false in
where predicates due to Oracle not allowing top level literal booleans
for some arcane reason or another.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: Distinct + Limit

From
Tom Lane
Date:
Francois Deliege <fdeliege@gmail.com> writes:
> I have the following table with millions of rows:

> CREATE TABLE table1
> (
>   col1 text,
>   col2 text,
>   col3 text,
>   col4 text,
>   col5 text,
>   col6 text
> )

> select col1 from table1 group by col1 limit 1;
> select distinct on (col1) col1 from table1 limit 1;

> select col1 from table1 group by col1 limit 2;
> select distinct on (col1) col1 from table1 limit 2;

> Performing any of these following queries results in a full sequential
scan, followed by a hash aggregate, and then the limit.

Well, if you had an index on the column, you would get a significantly
better plan ...

> Similarly, the following query results in a sequential scan:

> select * from table1 where col1 <> col1;

> This query is generated by the Sequel library abstraction layer in Ruby when filtering record based on a empty array
ofvalues. We fixed this by handling that case on the client side, but originally thought the server would have
rewrittenit and sent a empty result set. 

It does not, and never will, because that would be an incorrect
optimization.  "col1 <> col1" isn't constant false, it's more like
"col1 is not null".  I'd suggest "WHERE FALSE", or "WHERE 1 <> 1"
if you must, to generate a provably false constraint.

            regards, tom lane

Re: Distinct + Limit

From
Merlin Moncure
Date:
On Wed, Mar 28, 2012 at 9:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Francois Deliege <fdeliege@gmail.com> writes:
>> I have the following table with millions of rows:
>
>> CREATE TABLE table1
>> (
>>   col1 text,
>>   col2 text,
>>   col3 text,
>>   col4 text,
>>   col5 text,
>>   col6 text
>> )
>
>> select col1 from table1 group by col1 limit 1;
>> select distinct on (col1) col1 from table1 limit 1;
>
>> select col1 from table1 group by col1 limit 2;
>> select distinct on (col1) col1 from table1 limit 2;
>
>> Performing any of these following queries results in a full sequential
> scan, followed by a hash aggregate, and then the limit.
>
> Well, if you had an index on the column, you would get a significantly
> better plan ...
>
>> Similarly, the following query results in a sequential scan:
>
>> select * from table1 where col1 <> col1;
>
>> This query is generated by the Sequel library abstraction layer in Ruby when filtering record based on a empty array
ofvalues. We fixed this by handling that case on the client side, but originally thought the server would have
rewrittenit and sent a empty result set. 
>
> It does not, and never will, because that would be an incorrect
> optimization.  "col1 <> col1" isn't constant false, it's more like
> "col1 is not null".  I'd suggest "WHERE FALSE", or "WHERE 1 <> 1"
> if you must, to generate a provably false constraint.

'col1 is distinct from col1' could be optimized like that.  all though
it would be pretty hard to imagine a use case for it.

merlin