Re: Best way to get all different values in a column - Mailing list pgsql-performance

From Steve Atkins
Subject Re: Best way to get all different values in a column
Date
Msg-id 20051014170424.GA25578@gp.word-to-the-wise.com
Whole thread Raw
In response to Best way to get all different values in a column  (<ext-christian.roche@nokia.com>)
List pgsql-performance
On Fri, Oct 14, 2005 at 06:02:56PM +0200, ext-christian.roche@nokia.com wrote:
> Ok, since my question got no answer on the general list, I'm reposting
> it here since this list seems in fact better suited to it.
>
> Does anyone here know what  is  the most efficient way to  list all
> different values of a given column with low cardinality ?  For instance
> I have a table with columns DAY, NAME, ID, etc.  The table is updated
> about each week with thousands of records with the same (current) date.
> Now I would like to list all values for DAY, only if possible without
> scanning all the table each time I submit the request.
>
> I can think of:
>
> Solution 1: SELECT DAY FROM TABLE GROUP BY DAY;
>
> Solution 2: SELECT DISTINCT DAY FROM TABLE;
>
> (BTW why do those two yield such different performances, the later being
> seemingly *much* slower than the former  ?)
>
> Solution 3: Improve performance through an index scan by using DAY as
> the first element of the PK,  (PRIMARY KEY (DAY, ID) ), although DAY has
> a low cardinality ?
>
> Solution 4: Create a separate index on column DAY ?
>
> Solution 5: Use some kind of view / stored procedure that would be
> precomputed when TABLE is updated or cached when called for the first
> time ? Does something like that exist ?
>
> Solution 6: Store the values in a separate table, recreated each time
> TABLE is updated.
>
> This looks to me as a very common problem. Is there an obvious / best /
> standard solution there ? What would be the expected performance of the
> different solutions above ? (I guess some are probably non-sense)
>

There's not going to be a single "best" solution, as it'll depend on
your requirements, and on your application level constraints.

You say that the table is seldom updated (a few thousand a week is "almost
never"). If it's updated in a single batch you could simply generate
a table of the distinct values after each update pretty easily (solution
6).

If you don't have such a well-defined update then using a trigger on
inserts, updates and deletes of the table to update a separate table
to keep track of the counts of each distinct values, then you can
just select any row with a non-zero count from that table (solution 5).
(You need the counts to be able to deal with deletes efficiently). That
would increase the cost of updating the main table significantly, but
you're putting very little traffic through it, so that's unlikely to
be a problem.

I doubt that solutions 3 or 4 are worth looking at at all, and the first
two are what they are and you know their performance already.

You could probably do this far more efficiently with some of the work
being done in the application layer, rather than in the database - for
instance you could update the counts table one time per transaction,
rather than one time per operation - but that would lose you the
convenience of maintaining the counts correctly when you futz with
the data manually or using tools not aware of the count table.

Cheers,
  Steve

pgsql-performance by date:

Previous
From: mark@mark.mielke.cc
Date:
Subject: Re: Best way to get all different values in a column
Next
From: Martin Nickel
Date:
Subject: Sequential scan on FK join