Re: table-valued arguments for functions - Mailing list pgsql-general

From Wappler, Robert
Subject Re: table-valued arguments for functions
Date
Msg-id C8E2DAF0E663A948840B04023E0DE32A0241E01D@w2k3server02.de.ophardt.com
Whole thread Raw
In response to Re:  (Filip Rembiałkowski <plk.zuber@gmail.com>)
Responses Re: table-valued arguments for functions
List pgsql-general
> -----Original Message-----
> From: Filip Rembiałkowski [mailto:plk.zuber@gmail.com]
> Sent: Monday, December 21, 2009 6:51 PM
> To: Wappler, Robert
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL]
>
>
> 2009/12/21 Wappler, Robert <rwappler@ophardt.com>
>
>
>     Hello,
>
>     when defining functions, why can't they take table-valued arguments?
> I do not see the sense of this restriction. When a function takes a table
> as an argument, whole tables could be transformed in an obvious way.
> Currently functions must operate row-wise (at least I haven't found
> another way). This also limits the capabilities of joining the row-
> arguments, obviously this must be done before passing the arguments to the
> functions, instead of having one or more table-arguments and simply doing
> the joins within the function. Further, table valued arguments could lower
> the developer's work to prepare the arguments before the function call and
> enabling each functions to care for them itself by transforming them for
> the implementation instead of introducing several more or less anonymous
> record types just for passing values from one function to another which
> often also carry implementation specific constraints. The structure of the
> code as a set of functions would also greatly improve.
>
>
>
>
> SQL itself is a language which operates on data sets. Interesting, why you
> need a functional interface here?
>

Mainly for processing batches in a handy way.

> Could you give some example - what you want to achieve, and what's the
> missing element?
>  ( please add some DDL and code if possible ).
>
>
> As you know, every table in postgres has its rowtype definition.
>
> CREATE TABLE aaa ( ... );
> CREATE FUNCTION aaa_do (aaa) ...;
> SELECT aaa_do(aaa.*) from aaa;
>
> but that's the row-wise way.
>
> You can also operate on tables via their *names*:
> SELECT somefunc( 'aaa' ); /* some dynamic SQL inside */
>
> You will find some examples in contrib/tablefunc
>
>
>
>
>
>
>     What does the optimizer do with this per-row-semantic, even if it is
> invoked on each row? Does it transform the queries within the function to
> directly work on the table and realize possible advantages from indexes
> and other techniques?
>
> Yes that's what it does, in general.
> But, if processing is all "canned" inside user defined functions, the
> optimizer can't do it's job.
>
>

Assume we have a set of sensors, which regularly send some data. Depending on the frequency, the data is sent, the
performancefor insertion can become very important. So usually a very simple schema is used for the staging, e.g 

CREATE TABLE sensor_data (
    id SERIAL PRIMARY KEY,
    sensor_id int,
    time timestamp NOT NULL DEFAULT NOW(),
    temperature real,
    humidity real);

The sensor id alone does not carry enough information for creating useful reports, other tables may provide location
information.

CREATE TABLE sensor_location (
    sensor_id int PRIMARY KEY,
    location text NOT NULL);

A periodic job may now take a bulk of rows from the sensor_data table and insert them incrementally into tables used to
createreports. 

CREATE TABLE temperature_per_location (
    location text PRIMARY KEY,
    avg_temperature real,
    no_of_measurements int);

The row-wise way to update temperature_per_location could be

CREATE OR REPLACE FUNCTION update_tpl(measurement sensor_data)
RETURNS void VOLATILE AS $$
UPDATE temperature_per_location AS tpl2
SET (avg_temperature, no_of_measurements) =
    (SELECT (tpl.no_of_measurements * tpl.avg_temperature
        + $1.temperature) / (tpl.no_of_mesaurements + 1))
            AS avg_temperature,
        tpl.no_of_measurements + 1 AS no_of_measurements
    FROM sensor_location l JOIN temperature_per_location tpl USING
        (location)
    WHERE $1.sensor_id = l.sensor_id)
FROM sensor_location l2
WHERE $1.sensor_id = l2.sensor_id
    AND l2.location = tpl2.location;
$$ LANGUAGE SQL;

Obviously the above function must be called from inside a loop or within a SELECT-list to process a whole bulk of rows.
Atleast inside a loop, a cannot see, how an optimizer could make use of indexes, which may exist on sensor_location.
Theactual order of the updates will depend on the order the rows are passed in from table sensor_data. The call has to
beprepared e.g. by 

DECLARE
    batch SETOF sensor_data;
BEGIN
    SELECT * FROM select_batch() INTO batch;

    FOR b IN batch
        update_tpl(b);
    END FOR;
END;

Or a bit shorter but still verbose
SELECT update_tpl(id, sensor_id, time, temperature, humidity)
FROM select_batch();

Assuming we could pass a bulk of rows as a table, the update could be performed as follows:

CREATE OR REPLACE FUNCTION update_tpl(measurements SETOF sensor_data) RETURNS void VOLATILE AS $$
UPDATE temperature_per_location AS tpl
SET (avg_temperature, no_of_measurements) =
    ((tpl.no_of_measurements * tpl.temperature  +
        m.temperature)/(tpl.no_of_measurements
        + COUNT(*) OVER(PARTITION BY l.location)),     tpl.no_of_measurements + COUNT(*) OVER (PARTITION BY
l.location))
FROM measurements m JOIN sensor_location l USING (sensor_id)
WHERE tpl.location = l.location;
$$ LANGUAGE SQL;

The second version wouldn't need an enclosing loop, it could also use directly a combined index on sensor_location
(sensor_id,location) to 
optimize the iteration performed in the update. The code for the invocation may be

SELECT * FROM update_tpl(select_batch());

This example is synthetic and quite simple, obviously it lacks removal of rows from the staging table as well as an
initialinsertion before performing the update in the temperature_per_location table. But I think it illustrates the
featureof table-valued arguments. 

Another point where I met the wish perform operations on table-valued arguments is the insertion of missing values,
whichoften examines dependencies between different rows. For now this means to determine these rows before calling the
functionand passing the arguments to it or to make a set of incomplete data explicit by inserting it in an auxiliary
tablebefore continuing the processing or doing it within one function using variables and/or sub-queries. I'm not a fan
ofdoing everything in one lengthy function. Functions are also a tool to structure the code, e.g. into select_batch()
returningSETOF sometype, add_location_information(SETOF sometype) returning SETOF other_type, add_missing_values(SETOF
other_type)returning other type and finally update_tables(SETOF other_type) leading to easily changeable
implementationsof the different tasks. 


pgsql-general by date:

Previous
From: "A. Kretschmer"
Date:
Subject: Re: Not finding RPMs for 8.4.2!
Next
From: Filip Rembiałkowski
Date:
Subject: Re: Drop/ Alter index if exist