Home > mailing lists

Advice on efficient approach to aggregation statistics - Mailing list pgsql-general

From	postgresql.30.miller_2555@spamgourmet.com
Subject	Advice on efficient approach to aggregation statistics
Date	November 12, 2010 16:22:17
Msg-id	AANLkTi=1ORc6ivJz5nqAquOzX+qmWP3qw-7pO=nBrdSM@mail.gmail.com Whole thread Raw
List	pgsql-general

Tree view

Hi -

    I have a table for which I want to fit an exponential curve to values in one column of the table and compute a correlation statistic on the fit of the values to the exponential curve. The table consists of the following three columns: {"ID","Date","Value"}. The desired result of the exercise is to have a list of "ID"s, with the associated correlation statistic (based on the "Value" column) for a subset of "Date"s. The "fitted" exponential curve is not a true exponential regression for performance reasons. It uses the exponential form: y = a * exp(b * x), but selects the constant values 'a' & 'b' using only the first and last "Values" (as sorted by "Date") -- so it's really a best-fit exponential curve which is assured to pass through the first and last sorted points. The following is a simplified example (assuming the best practice solution involves the development of an aggregate function):

Table (primary key is {"ID","Date"}):
    ID |     Date | Value
    ----------------------------------
    A | 2010-11-12 | 48.0000
    A | 2010-11-11 | 24.0000
    A | 2010-11-10 | 12.0000
    A | 2010-11-09 | 6.0000
    A | 2010-11-08 | 3.0000
    A | 2010-11-07 | 1.5000
    B | 2010-11-12 | 33.1479
    B | 2010-11-11 | 19.5030
    B | 2010-11-10 | 11.4749
    B | 2010-11-09 | 6.7514
    B | 2010-11-08 | 3.9723
    B | 2010-11-07 | 2.3371

Hypothetical query:
    SELECT "ID", correl_exp("Value") AS "Corr"
    FROM Table
    WHERE "Date" >= '2010-11-09'
    GROUP BY "ID";

Expected output:
    ID | Corr
    ---------------
    A | 1.0000
    B | 0.8690

If attempting to reconstruct the simple example, the following are the values for constants 'a' and 'b' for "ID"s A & B (also note that values for dates 11/7 and 11/8 are ignored given the hypothetical query):
    A: a = 3.0000, b = 0.6931
    B: a = 1.4283, b = 0.9513

Questions:
    1) What is the best approach to accomplish the expected result?
        I would think the creation of an aggregate function would be the "best" option (i.e. highest performance with little ongoing database maintenence), but thought a nested SQL statement may also work.
    2) If the creation of an aggregate function is the best option,
        A) What is the highest performance language (or does it even matter)? PL/C, PL/R, PL/pgSQL, other? Can one use pre-complied shared libraries?
        B) What is the best state function used to construct an array of "Values", sorted by "Date" (as both the first and the last value need to be known to compute the exponential curve)? I have heard array_append has performance issues.

Many thanks -
Will

PostgreSQL v. 8.3.8 (expecting to upgrade to v.9.0.1 soon -- so best practice solution should be compatable with v9.0.1 -- if best solution takes advantage of improvements in v9.0.1 and is incompatable with v8.3.8, please note and I can implement the solution after upgrade)

pgsql-general by date:

From: "Mark Mitchell"
Date: 12 November 2010, 15:35:05
Subject: Re: More then 1600 columns?

From: "Clark C. Evans"
Date: 12 November 2010, 19:24:45
Subject: Re: More then 1600 columns?

Advice on efficient approach to aggregation statistics - Mailing list pgsql-general

Previous

Next