pgsql-server/contrib/tablefunc README.tablefun ... - Mailing list pgsql-committers

From momjian@postgresql.org (Bruce Momjian - CVS)
Subject pgsql-server/contrib/tablefunc README.tablefun ...
Date
Msg-id 20030320064630.DA4A3475C15@postgresql.org
Whole thread Raw
List pgsql-committers
CVSROOT:    /cvsroot
Module name:    pgsql-server
Changes by:    momjian@postgresql.org    03/03/20 01:46:30

Modified files:
    contrib/tablefunc: README.tablefunc tablefunc.c tablefunc.h
                       tablefunc.sql.in
    contrib/tablefunc/expected: tablefunc.out
    contrib/tablefunc/sql: tablefunc.sql

Log message:
    Attached is an update to contrib/tablefunc. It implements a new hashed
    version of crosstab. This fixes a major deficiency in real-world use of
    the original version. Easiest to undestand with an illustration:

    Data:
    -------------------------------------------------------------------
    select * from cth;
    id | rowid |        rowdt        |   attribute    |      val
    ----+-------+---------------------+----------------+---------------
    1 | test1 | 2003-03-01 00:00:00 | temperature    | 42
    2 | test1 | 2003-03-01 00:00:00 | test_result    | PASS
    3 | test1 | 2003-03-01 00:00:00 | volts          | 2.6987
    4 | test2 | 2003-03-02 00:00:00 | temperature    | 53
    5 | test2 | 2003-03-02 00:00:00 | test_result    | FAIL
    6 | test2 | 2003-03-02 00:00:00 | test_startdate | 01 March 2003
    7 | test2 | 2003-03-02 00:00:00 | volts          | 3.1234
    (7 rows)

    Original crosstab:
    -------------------------------------------------------------------
    SELECT * FROM crosstab(
    'SELECT rowid, attribute, val FROM cth ORDER BY 1,2',4)
    AS c(rowid text, temperature text, test_result text, test_startdate
    text, volts text);
    rowid | temperature | test_result | test_startdate | volts
    -------+-------------+-------------+----------------+--------
    test1 | 42          | PASS        | 2.6987         |
    test2 | 53          | FAIL        | 01 March 2003  | 3.1234
    (2 rows)

    Hashed crosstab:
    -------------------------------------------------------------------
    SELECT * FROM crosstab(
    'SELECT rowid, attribute, val FROM cth ORDER BY 1',
    'SELECT DISTINCT attribute FROM cth ORDER BY 1')
    AS c(rowid text, temperature int4, test_result text, test_startdate
    timestamp, volts float8);
    rowid | temperature | test_result |   test_startdate    | volts
    -------+-------------+-------------+---------------------+--------
    test1 |          42 | PASS        |                     | 2.6987
    test2 |          53 | FAIL        | 2003-03-01 00:00:00 | 3.1234
    (2 rows)

    Notice that the original crosstab slides data over to the left in the
    result tuple when it encounters missing data. In order to work around
    this you have to be make your source sql do all sorts of contortions
    (cartesian join of distinct rowid with distinct attribute; left join
    that back to the real source data). The new version avoids this by
    building a hash table using a second distinct attribute query.

    The new version also allows for "extra" columns (see the README) and
    allows the result columns to be coerced into differing datatypes if they
    are suitable (as shown above).

    In testing a "real-world" data set (69 distinct rowid's, 27 distinct
    categories/attributes, multiple missing data points) I saw about a
    5-fold improvement in execution time (from about 2200 ms old, to 440 ms
    new).

    I left the original version intact because: 1) BC, 2) it is probably
    slightly faster if you know that you have no missing attributes.

    README and regression test adjustments included. If there are no
    objections, please apply.

    Joe Conway


pgsql-committers by date:

Previous
From: momjian@postgresql.org (Bruce Momjian - CVS)
Date:
Subject: pgsql-server/src/bin/psql command.c common.c c ...
Next
From: momjian@postgresql.org (Bruce Momjian - CVS)
Date:
Subject: pgsql-server/doc TODO