Re: Finding latest record for a number of groups in an INSERT-only table

From: Gavin Flower
Subject: Re: Finding latest record for a number of groups in an INSERT-only table
Date: ,
Msg-id: 4E12DD48.70201@archidevsys.co.nz
(view: Whole thread, Raw)
In response to: Finding latest record for a number of groups in an INSERT-only table  (Daniel Farina)
List: pgsql-general


On 05/07/11 11:48, Daniel Farina wrote:
> This is basically exactly the same as
> http://archives.postgresql.org/pgsql-sql/2008-10/msg00009.php; I'm
> just asking again, to see if thinking on the problem has changed:
>
> The basic problem, restated, is one has a relation with tuples like this:
>
> (key, recency_stamp, other_columns*)
>
> Some example data may be:
>
> 'ice cream', 0, 'vanilla'
> 'ice cream', 1, 'chocolate'
> 'beer', 0, 'ale'
> 'beer', 3, 'lager'
>
> The goal is to extract from this relation the following:
>
> 'ice cream', '1', 'chocolate'
> 'beer', 3, 'lager'
>
> In my dataset, the distinct number of keys is<<  the number of rows;
> in fact, that ratio grows ever smaller until the data is truncated,
> with a convergence point probably resting at a negligible fraction of
> a percent -- definitely less than 0.1% in short order, easily getting
> to 0.01% and 0.001% in not a vast amount of time.  There is a standard
> btree index on (key, recency_stamp), which ideally is exploitable to
> compute the 'most recent' relation very quickly.  Approaches I've
> investigated include DISTINCT ON (this may be the cleanest), windowing
> functions (rank, and a predicate), and a GROUP BY with a self-join,
> but none of them avoid a full relation scan, which is just crippling
> performance-wise.
>
> The only way I could find to trick the optimizer into doing something
> non-crippling was to trick it with
> http://wiki.postgresql.org/wiki/Loose_indexscan to find distinct
> values quickly and subsequently using a subquery in the target list
> that the optimizer decides to run over every distinct value (it
> estimates a very high cost for this, even though it is actually rather
> fast, but the ridiculous costing can subsequent joins against the
> relation in unfortunate ways).  This mechanism is obscure enough that
> I may just write a plpgsql function as a workaround, as that may well
> be more lucid.
>
> I am using PostgreSQL 9.
>
> Does anyone have fresh thoughts or suggestions for dealing with
> INSERT-mostly tables conceived in this manner?
>
My solution, which is not especially optimised for an 'insert only'
table, is:

DROP TABLE IF EXISTS T;

CREATE TABLE T
(
     type    text,
     id      int,
     flavour text,
     PRIMARY kEY (type, id)
);

INSERT INTO T (type, id, flavour) VALUES
     ('ice cream', 0, 'vanilla'),
     ('ice cream', 1, 'chocolate'),
     ('beer', 0, 'ale'),
     ('beer', 3, 'lager');

WITH
     M (type, id) AS
     (
         SELECT
             type,
             max(id)
         FROM
             T
         GROUP BY
             type
     )
SELECT
     T.type,
     T.id,
     T.flavour
FROM
     M,
     T
WHERE
     T.type = M.type AND
     T.id = M.id
;

Cheers,
Gavin


pgsql-general by date:

From: Tom Lane
Date:
Subject: Re: [ADMIN] Python UCS4 error
From: "Jonathan Brinkman"
Date:
Subject: Re: Read MS-SQL data into Postgres via ODBC link?