Thread: SELECT DISTINCT very slow
Hi, Can anybody explain this: Records: 600,000 Field Width: Approximately 20 UTF8 characters - type is VARCHAR(25) Field is Indexed. SELECT DISTINCT field FROM table; Takes about 6 seconds. There are 111 distinct items. On Sqlite, and another place where I have a B+Tree, this query is faster than my eye can measure. Is this a well known issue? Thanks, Ben
In response to Ben Harper <rogojin@gmail.com>: > Hi, > Can anybody explain this: > > Records: 600,000 > Field Width: Approximately 20 UTF8 characters - type is VARCHAR(25) > Field is Indexed. > > SELECT DISTINCT field FROM table; > > Takes about 6 seconds. There are 111 distinct items. What's the output of EXPLAIN ANALYZE SELECT DISTINCT field FROM table;? Does a VACUUM ANALYZE of the table help? Is the query significantly faster the second time you run it? > Is this a well known issue? Not that I'm aware of. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/
On Thursday 09 July 2009 17:09:13 Ben Harper wrote: > Hi, > Can anybody explain this: > > Records: 600,000 > Field Width: Approximately 20 UTF8 characters - type is VARCHAR(25) > Field is Indexed. > > SELECT DISTINCT field FROM table; > > Takes about 6 seconds. There are 111 distinct items. > > On Sqlite, and another place where I have a B+Tree, this query is > faster than my eye can measure. > > Is this a well known issue? Yes, I think so. AFAIK the primary cause is that indexes in pg do not store visibility information. That means you need to check for existence of the tuple on the heap. Possibly due to that PG has no special case code for DISTINCT to optimize such a query using mostly the index. It would be possible that for each possible value of 'field' you check the index only long enough to prove that there is at least one such entry. Taking that single field into its own table is not possible? Andres
Hello when you use older pg than 8.3, please, use GROUP BY. SELECT field FROM table GROUP BY field. Regards Pavel Stehule 2009/7/9 Ben Harper <rogojin@gmail.com>: > Hi, > Can anybody explain this: > > Records: 600,000 > Field Width: Approximately 20 UTF8 characters - type is VARCHAR(25) > Field is Indexed. > > SELECT DISTINCT field FROM table; > > Takes about 6 seconds. There are 111 distinct items. > > On Sqlite, and another place where I have a B+Tree, this query is > faster than my eye can measure. > > Is this a well known issue? > > Thanks, > Ben > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On Thu, Jul 9, 2009 at 4:47 PM, Andres Freund<andres@anarazel.de> wrote: > AFAIK the primary cause is that indexes in pg do not store visibility > information. Not really. The OP doesn't say how wide the record rows are but unless they're very wide it wouldn't pay to use an index for this even if you didn't have to access the heap also. It's going to be faster to scan the whole heap and either sort or use a hash. Currently there aren't many cases where a btree with 6,000 copies of 111 distinct keys is going to be useful. Arguably the missing feature here is skip-scans where we scan the index but only pull out one record for each distinct value. I'm not sure there's anything particularly stopping Postgres from being able to do them, but it might be a lot of code for a narrow use case. -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> writes: > Not really. The OP doesn't say how wide the record rows are but unless > they're very wide it wouldn't pay to use an index for this even if you > didn't have to access the heap also. It's going to be faster to scan > the whole heap and either sort or use a hash. Currently there aren't > many cases where a btree with 6,000 copies of 111 distinct keys is > going to be useful. It was 600,000 not 6,000 ... so a skip-scan might be worth the trouble, but as you say we haven't done it. In any case I think the real issue is that the OP is probably using a pre-8.4 release which will always do SELECT DISTINCT via sort-and-unique. Hash aggregation would be a whole lot faster for these numbers, even if not exactly instantaneous. He could update to 8.4, or go over to using GROUP BY as was recommended upthread. regards, tom lane
Thanks for all the feedback. Using GROUP BY is indeed much faster (about 1 second). Unfortunately I can't use GROUP BY, because what I'm really doing is SELECT DISTINCT ON(unique_field) id FROM table; I'm not familiar with the Postgres internals, but in my own DB system that I have written, I do the skip-scanning thing, and for my system it was a really trivial optimization to code. I know, I'm always free to submit a patch, and hopefully someday I will, if it hasn't already been done by then. I can't comment on whether this skip-scan optimization is general enough to warrant the lines of code, but I might as well explain my use case: Inside a GIS application, the user wants to categorize the display of some information based on, in this case, the suburb name. He clicks a button that says "Add All Unique Categories". This is a very common operation in this domain. Again, thanks for all the feedback. I'll upgrade to 8.4 soon. Ben Harper On Fri, Jul 10, 2009 at 2:50 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> Not really. The OP doesn't say how wide the record rows are but unless >> they're very wide it wouldn't pay to use an index for this even if you >> didn't have to access the heap also. It's going to be faster to scan >> the whole heap and either sort or use a hash. Currently there aren't >> many cases where a btree with 6,000 copies of 111 distinct keys is >> going to be useful. > > It was 600,000 not 6,000 ... so a skip-scan might be worth the trouble, > but as you say we haven't done it. > > In any case I think the real issue is that the OP is probably using a > pre-8.4 release which will always do SELECT DISTINCT via sort-and-unique. > Hash aggregation would be a whole lot faster for these numbers, even > if not exactly instantaneous. He could update to 8.4, or go over to > using GROUP BY as was recommended upthread. > > regards, tom lane >
On Fri, Jul 10, 2009 at 1:41 PM, Ben Harper<rogojin@gmail.com> wrote: > > Unfortunately I can't use GROUP BY, because what I'm really doing is > SELECT DISTINCT ON(unique_field) id FROM table; You could do that using GROUP BY if you define a first() aggregate. In this case that would just be SELECT first(id) AS id from (select * from table ORDER BY unique_field, ...) GROUP BY unique_field. In cases with more fields it gets tiresome fast. In this case 8.4 won't actually help you. It only uses hash aggregates for DISTINCT not DISTINCT ON. > I'm not familiar with the Postgres internals, but in my own DB system > that I have written, I do the skip-scanning thing, and for my system > it was a really trivial optimization to code. Well things get tricky quickly when you have to deal with concurrent inserts and potentially page splits from other transactions. Also consider how hard it is to prove that the query falls into this category of queries. > Inside a GIS application, the user wants to categorize the display of > some information based on, in this case, the suburb name. > He clicks a button that says "Add All Unique Categories". This is a > very common operation in this domain. That doesn't look like what this query is doing to me. It's taking one exemplar from each suburb based on some other constraint (the minimum of whatever your order by key is) and taking the id of that data point. If the order by key doesn't specify a specific data point then it's a non-deterministic record. -- greg http://mit.edu/~gsstark/resume.pdf
On Fri, 2009-07-10 at 01:36 +0100, Greg Stark wrote: > Arguably the missing feature here is skip-scans where we scan the > index but only pull out one record for each distinct value. I'm not > sure there's anything particularly stopping Postgres from being able > to do them, but it might be a lot of code for a narrow use case. Hypothetically, would something like a "sort distinct" operator be of any use? I wonder how much work it would save if the sort could save steps by weeding out duplicate tuples while sorting. That might make sort into a better plan in cases where don't have a good estimate of the distinct values. Regards, Jeff Davis