Thread: SELECT DISTINCT too slow

SELECT DISTINCT too slow

From
Miroslav Šulc
Date:
Hello,

I have a table with cca 400,000 rows. The table contains column "key" of varchar(20) type containing 10 distinct values. I want to get out what distinct values are present in the column. I use this simple query, which is very slow:

SELECT DISTINCT Key FROM MRTPContactValue

Here is the query plan:

QUERY PLAN
Unique  (cost=64882.26..66964.59 rows=9 width=9) (actual time=26139.972..29593.164 rows=10 loops=1)
  ->  Sort  (cost=64882.26..65923.43 rows=416466 width=9) (actual time=26139.964..27975.944 rows=416466 loops=1)
        Sort Key: "key"
        ->  Seq Scan on mrtpcontactvalue  (cost=0.00..8669.66 rows=416466 width=9) (actual time=0.026..2460.535 rows=416466 loops=1)
Total runtime: 29603.159 ms

I've tried index on the "key" column but no improvement.

Is there a way to speed the SELECT up?

Thank you for any suggestions.
-- 
Miroslav Šulc
Attachment

Re: SELECT DISTINCT too slow

From
Alvaro Herrera
Date:
Miroslav ?ulc wrote:
> Hello,
> 
> I have a table with cca 400,000 rows. The table contains column "key" of
> varchar(20) type containing 10 distinct values. I want to get out what
> distinct values are present in the column. I use this simple query,
> which is very slow:
> 
> SELECT DISTINCT Key FROM MRTPContactValue

You could get the universe of values from the table where this is a
primary key, and use an IN clause (which apparently is more efficient
than an EXISTS in some cases, but try that too) to search for values
that exist in the MRTPContactValue table.

I assume you do have the other table, don't you?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: SELECT DISTINCT too slow

From
Miroslav Šulc
Date:
Well, "key" is not primary key from another table. It is just a column in pair "key" => "value".
The structure of the table is this:

Id (primary key)
MRTPContactId (id of contact from table MRTPContact)
Key (key from pair key => value)
Value (value from pair key => value)

So I want the get the list of keys used in the table.

Miroslav Šulc


Alvaro Herrera napsal(a):
Miroslav ?ulc wrote: 
Hello,

I have a table with cca 400,000 rows. The table contains column "key" of
varchar(20) type containing 10 distinct values. I want to get out what
distinct values are present in the column. I use this simple query,
which is very slow:

SELECT DISTINCT Key FROM MRTPContactValue   
You could get the universe of values from the table where this is a
primary key, and use an IN clause (which apparently is more efficient
than an EXISTS in some cases, but try that too) to search for values
that exist in the MRTPContactValue table.

I assume you do have the other table, don't you?
 
Attachment

Re: SELECT DISTINCT too slow

From
Alvaro Herrera
Date:
Miroslav ?ulc wrote:
> Well, "key" is not primary key from another table. It is just a column
> in pair "key" => "value".
> The structure of the table is this:
> 
> Id (primary key)
> MRTPContactId (id of contact from table MRTPContact)
> Key (key from pair key => value)
> Value (value from pair key => value)
> 
> So I want the get the list of keys used in the table.

The plan you get is the most efficient possible for that query.  If you
had a table of possible keys (which should of course be FK of "Key"),
you could get a much faster version :-)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: SELECT DISTINCT too slow

From
Miroslav Šulc
Date:
It might be a good solution :-)

Thank you for your help.

Miroslav Šulc


Alvaro Herrera napsal(a):
> Miroslav ?ulc wrote:
>
>> Well, "key" is not primary key from another table. It is just a column
>> in pair "key" => "value".
>> The structure of the table is this:
>>
>> Id (primary key)
>> MRTPContactId (id of contact from table MRTPContact)
>> Key (key from pair key => value)
>> Value (value from pair key => value)
>>
>> So I want the get the list of keys used in the table.
>>
>
> The plan you get is the most efficient possible for that query.  If you
> had a table of possible keys (which should of course be FK of "Key"),
> you could get a much faster version :-)
>
>

Attachment

Re: SELECT DISTINCT too slow

From
Tom Lane
Date:
Miroslav Šulc <miroslav.sulc@startnet.cz> writes:
> I have a table with cca 400,000 rows. The table contains column "key" of
> varchar(20) type containing 10 distinct values. I want to get out what
> distinct values are present in the column. I use this simple query,
> which is very slow:

> SELECT DISTINCT Key FROM MRTPContactValue

TrySELECT Key FROM MRTPContactValue GROUP BY Key

The "select distinct" code is a bit old and crufty, GROUP BY is usually
smarter.
        regards, tom lane


Re: SELECT DISTINCT too slow

From
Miroslav Šulc
Date:
The GROUP BY is really fast :-)

Thank you.

Miroslav Šulc


Tom Lane napsal(a):
> Try
>     SELECT Key FROM MRTPContactValue GROUP BY Key
>
> The "select distinct" code is a bit old and crufty, GROUP BY is usually
> smarter.
>
>             regards, tom lane
>

Attachment

Re: SELECT DISTINCT too slow

From
Alvaro Herrera
Date:
Miroslav ?ulc wrote:
> The GROUP BY is really fast :-)

Doh!  How does it do it?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: SELECT DISTINCT too slow

From
Greg Stark
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:

> Miroslav ?ulc wrote:
> > Well, "key" is not primary key from another table. It is just a column
> > in pair "key" => "value".
> > The structure of the table is this:
> > 
> > Id (primary key)
> > MRTPContactId (id of contact from table MRTPContact)
> > Key (key from pair key => value)
> > Value (value from pair key => value)
> > 
> > So I want the get the list of keys used in the table.
> 
> The plan you get is the most efficient possible for that query.  If you
> had a table of possible keys (which should of course be FK of "Key"),
> you could get a much faster version :-)

Actually you could try the equivalent query:
SELECT Key FROM MRTPContactValue GROUP BY Key

This may or may not be faster because it can use a hash aggregate plan. I
would expect it to be faster here because there are few distinct keys and the
planner predicts that. 

Eventually these two queries should be handled the same by Postgres but Hash
Aggregates are a new addition and DISTINCT/DISTINCT ON hasn't been adapted to
make use of them.

Also, incidentally, I don't see how a table of possible keys could help you
here. Nothing forces they table MRTPContactValue to use all possible keys...

-- 
greg



Re: SELECT DISTINCT too slow

From
Miroslav Šulc
Date:
Greg Stark napsal(a):
Actually you could try the equivalent query:
SELECT Key FROM MRTPContactValue GROUP BY Key

This may or may not be faster because it can use a hash aggregate plan. I
would expect it to be faster here because there are few distinct keys and the
planner predicts that. 

Eventually these two queries should be handled the same by Postgres but Hash
Aggregates are a new addition and DISTINCT/DISTINCT ON hasn't been adapted to
make use of them.

Also, incidentally, I don't see how a table of possible keys could help you
here. Nothing forces they table MRTPContactValue to use all possible keys... 

I simpified the case because it was slow by itself. GROUP BY really makes this a lot faster.


The table contains properties for each contact that I cannot control how many properties and what names of the properties there will be. In my scenario user can export the data through user interface and I need to know what keys are used there to create appropriate column names. There is even one constraint. The contacts are grouped into groups so I need to get only the keys from a selected group. The real query is this (which is not so fast as the plain SELECT ... GROUP BY ... because the other table is also large enough) but now it is faster than before:

SELECT Key FROM MRTPContactValue
INNER JOIN MRTPContact
ON MRTPContactValue.MRTPContactId = MRTPContact.Id
WHERE MRTPContact.MRTPWaveQuestionnaireId = 1
GROUP BY Key


Here's the query plan:

QUERY PLAN
HashAggregate  (cost=32639.67..32639.76 rows=9 width=9) (actual time=19407.116..19407.146 rows=10 loops=1)
  ->  Hash Join  (cost=8070.36..31598.51 rows=416466 width=9) (actual time=5917.367..17607.502 rows=416466 loops=1)
        Hash Cond: ("outer".mrtpcontactid = "inner".id)
        ->  Seq Scan on mrtpcontactvalue  (cost=0.00..8669.66 rows=416466 width=17) (actual time=9.094..4233.131 rows=416466 loops=1)
        ->  Hash  (cost=7119.80..7119.80 rows=137824 width=8) (actual time=5096.750..5096.750 rows=137824 loops=1)
              ->  Seq Scan on mrtpcontact  (cost=0.00..7119.80 rows=137824 width=8) (actual time=9.312..4337.647 rows=137824 loops=1)
                    Filter: (mrtpwavequestionnaireid = 1)
Total runtime: 19417.873 ms


The same query using DISTINCT takes about 40 sec to complete.


Thank you.

--
Miroslav Šulc
Attachment

Re: SELECT DISTINCT too slow

From
Florian Weimer
Date:
* Alvaro Herrera:

> Miroslav ?ulc wrote:
>> The GROUP BY is really fast :-)
>
> Doh!  How does it do it?

It uses a hash table and can therefore discard duplicate rows more
quickly (essentially linear time in the number of rows if the number
of different rows is bounded).