Re: SELECT DISTINCT too slow - Mailing list pgsql-sql

From Miroslav Šulc
Subject Re: SELECT DISTINCT too slow
Date
Msg-id 447F109D.5030400@startnet.cz
Whole thread Raw
In response to Re: SELECT DISTINCT too slow  (Greg Stark <gsstark@mit.edu>)
List pgsql-sql
Greg Stark napsal(a):
Actually you could try the equivalent query:
SELECT Key FROM MRTPContactValue GROUP BY Key

This may or may not be faster because it can use a hash aggregate plan. I
would expect it to be faster here because there are few distinct keys and the
planner predicts that. 

Eventually these two queries should be handled the same by Postgres but Hash
Aggregates are a new addition and DISTINCT/DISTINCT ON hasn't been adapted to
make use of them.

Also, incidentally, I don't see how a table of possible keys could help you
here. Nothing forces they table MRTPContactValue to use all possible keys... 

I simpified the case because it was slow by itself. GROUP BY really makes this a lot faster.


The table contains properties for each contact that I cannot control how many properties and what names of the properties there will be. In my scenario user can export the data through user interface and I need to know what keys are used there to create appropriate column names. There is even one constraint. The contacts are grouped into groups so I need to get only the keys from a selected group. The real query is this (which is not so fast as the plain SELECT ... GROUP BY ... because the other table is also large enough) but now it is faster than before:

SELECT Key FROM MRTPContactValue
INNER JOIN MRTPContact
ON MRTPContactValue.MRTPContactId = MRTPContact.Id
WHERE MRTPContact.MRTPWaveQuestionnaireId = 1
GROUP BY Key


Here's the query plan:

QUERY PLAN
HashAggregate  (cost=32639.67..32639.76 rows=9 width=9) (actual time=19407.116..19407.146 rows=10 loops=1)
  ->  Hash Join  (cost=8070.36..31598.51 rows=416466 width=9) (actual time=5917.367..17607.502 rows=416466 loops=1)
        Hash Cond: ("outer".mrtpcontactid = "inner".id)
        ->  Seq Scan on mrtpcontactvalue  (cost=0.00..8669.66 rows=416466 width=17) (actual time=9.094..4233.131 rows=416466 loops=1)
        ->  Hash  (cost=7119.80..7119.80 rows=137824 width=8) (actual time=5096.750..5096.750 rows=137824 loops=1)
              ->  Seq Scan on mrtpcontact  (cost=0.00..7119.80 rows=137824 width=8) (actual time=9.312..4337.647 rows=137824 loops=1)
                    Filter: (mrtpwavequestionnaireid = 1)
Total runtime: 19417.873 ms


The same query using DISTINCT takes about 40 sec to complete.


Thank you.

--
Miroslav Šulc
Attachment

pgsql-sql by date:

Previous
From: Greg Stark
Date:
Subject: Re: SELECT DISTINCT too slow
Next
From: postgres@jal.org
Date:
Subject: Re: Table design question