Re: Best way to get all different values in a column - Mailing list pgsql-performance
From | Steve Atkins |
---|---|
Subject | Re: Best way to get all different values in a column |
Date | |
Msg-id | 20051014170424.GA25578@gp.word-to-the-wise.com Whole thread Raw |
In response to | Best way to get all different values in a column (<ext-christian.roche@nokia.com>) |
List | pgsql-performance |
On Fri, Oct 14, 2005 at 06:02:56PM +0200, ext-christian.roche@nokia.com wrote: > Ok, since my question got no answer on the general list, I'm reposting > it here since this list seems in fact better suited to it. > > Does anyone here know what is the most efficient way to list all > different values of a given column with low cardinality ? For instance > I have a table with columns DAY, NAME, ID, etc. The table is updated > about each week with thousands of records with the same (current) date. > Now I would like to list all values for DAY, only if possible without > scanning all the table each time I submit the request. > > I can think of: > > Solution 1: SELECT DAY FROM TABLE GROUP BY DAY; > > Solution 2: SELECT DISTINCT DAY FROM TABLE; > > (BTW why do those two yield such different performances, the later being > seemingly *much* slower than the former ?) > > Solution 3: Improve performance through an index scan by using DAY as > the first element of the PK, (PRIMARY KEY (DAY, ID) ), although DAY has > a low cardinality ? > > Solution 4: Create a separate index on column DAY ? > > Solution 5: Use some kind of view / stored procedure that would be > precomputed when TABLE is updated or cached when called for the first > time ? Does something like that exist ? > > Solution 6: Store the values in a separate table, recreated each time > TABLE is updated. > > This looks to me as a very common problem. Is there an obvious / best / > standard solution there ? What would be the expected performance of the > different solutions above ? (I guess some are probably non-sense) > There's not going to be a single "best" solution, as it'll depend on your requirements, and on your application level constraints. You say that the table is seldom updated (a few thousand a week is "almost never"). If it's updated in a single batch you could simply generate a table of the distinct values after each update pretty easily (solution 6). If you don't have such a well-defined update then using a trigger on inserts, updates and deletes of the table to update a separate table to keep track of the counts of each distinct values, then you can just select any row with a non-zero count from that table (solution 5). (You need the counts to be able to deal with deletes efficiently). That would increase the cost of updating the main table significantly, but you're putting very little traffic through it, so that's unlikely to be a problem. I doubt that solutions 3 or 4 are worth looking at at all, and the first two are what they are and you know their performance already. You could probably do this far more efficiently with some of the work being done in the application layer, rather than in the database - for instance you could update the counts table one time per transaction, rather than one time per operation - but that would lose you the convenience of maintaining the counts correctly when you futz with the data manually or using tools not aware of the count table. Cheers, Steve
pgsql-performance by date: