Thread: Adding pipelining support to set returning functions

Adding pipelining support to set returning functions

From

Hannu Krosing

Date:

06 April 2008, 04:02:17

A question to all pg hackers

Is anybody working on adding pipelining to set returning functions.

How much effort would it take ?

Where should I start digging ?

BACKGROUND:

AFAICS , currently set returning functions materialise their results
before returning, as seen by this simple test:

hannu=# select * from generate_series(1,10) limit 2;generate_series 
-----------------              1              2
(2 rows)

Time: 1.183 ms


hannu=# select * from generate_series(1,10000000) limit 2;generate_series 
-----------------              1              2
(2 rows)

Time: 3795.032 ms

being able to pipeline (generate results as needed) would enable several
interesting techniques, especially if combined with pl/proxy or any
other functions which stream external data.

Applications and design patterns like http://telegraph.cs.berkeley.edu/
or http://labs.google.com/papers/mapreduce.html would suddenly become
very easy to implement.

-----------------
Hannu

Re: Adding pipelining support to set returning functions

From

David Fetter

Date:

06 April 2008, 18:17:17

On Sun, Apr 06, 2008 at 10:01:20AM +0300, Hannu Krosing wrote:
> A question to all pg hackers
> 
> Is anybody working on adding pipelining to set returning functions.
> 
> How much effort would it take ?
> 
> Where should I start digging ?
> 
> BACKGROUND:
> 
> AFAICS , currently set returning functions materialise their results
> before returning, as seen by this simple test:

+1 for doing this. :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Adding pipelining support to set returning functions

From

Joe Conway

Date:

06 April 2008, 18:32:51

Hannu Krosing wrote:
> A question to all pg hackers
> 
> Is anybody working on adding pipelining to set returning functions.

Not as far as I know.

> How much effort would it take ?
> 
> Where should I start digging ?

I don't remember all the details, but I think the original SRF patch 
that I did was pipelined rather that materialized, and there were 
difficult issues that led to going with the materialized version 
"first", with a pipelined version to "eventually" follow. 
SFRM_Materialize is for the former and SFRM_ValuePerCall was supposed to 
be for the latter. The first thread was this one:  http://archives.postgresql.org/pgsql-hackers/2002-04/msg01201.php

Joe

Re: Adding pipelining support to set returning functions

From

Hannu Krosing

Date:

11 April 2008, 07:03:14

On Fri, 2008-04-11 at 10:57 +0200, Hans-Juergen Schoenig wrote:
> Hannu Krosing wrote:
> > A question to all pg hackers
> >
> > Is anybody working on adding pipelining to set returning functions.
> >
> > How much effort would it take ?
> >
> > Where should I start digging ?
> >   
> 
> i asked myself basically the same question some time ago.
> pipelining seems fairly impossible unless we ban joins on those 
> "plugins" completely.

Not really, just they have to be "materialized" before joins, or
streaming node has to be at the driving side of the join, so you can
fetch one tuple and then join it to index or hash lookup

> i think this should be fine for your case (no need to join PL/proxy 
> partitions) - what we want here is to re-unify data and sent it through 
> centralized BI.

In PL/Proxy context I was aiming at sorting data at nodes and then being
able to merge several partitions while preserving order, and doing this
without having to store N*partition_size rows in resultset.

> >
...
> >   
> 
> currently things like nodeSeqscan do SeqNext and so on - one records is 
> passed on to the next level.
> why not have a nodePlugin or so doing the same?
> or maybe some additional calling convention for streaming functions...
> 
> e.g.:
> CREATE STREAMING FUNCTION xy() RETURNS NEXT RECORD AS $$
>     return exactly one record to keep doing
>     return NULL to mark "end of table"
> $$ LANGUAGE 'any';
> 
> so - for those function no ...
>     WHILE ...
>        RETURN NEXT
> 
> but just one tuple per call ...
> this would pretty much do it for this case.
> i would not even call this a special case - whenever there is a LOT of 
> data, this could make sense.

In python (an also javascript starting at version 1.7) you do it by
returning a generator from a function, which is done by using YIELD
instead of return.

>>> def numgen(i):
...     while 1:
...         yield i
...         i += 1
>>> ng = numgen(1)
>>> ng
<generator object at 0xb7ce3bcc>
>>> ng.next()
1
>>> ng.next()
2

In fact any pl/python function SRF puts its result set to retun buffer
using generator mechanisms, even in case you return the result from
function as a list or an array.

What would be nice, is to wire the python generator directly to
postgreSQLs FuncNext call

At C function level this should probably be a mirror image of AGGREGATE
functions, where you have a init() function that prepares some opaque
data structure and next() for getting records with some special value
for end, and preferrably also some finalize() to clean up in case
postgresql stops before next() indicated EOD.

Maybe some extra info would be nice for optimized, like expected
rowcouunt or that data is returned sorted on some field. This would be
good for current return mechanisms as well.

-----------------
Hannu

Re: Adding pipelining support to set returning functions

From

Martijn van Oosterhout

Date:

11 April 2008, 08:11:47

On Fri, Apr 11, 2008 at 01:00:04PM +0300, Hannu Krosing wrote:
> > i asked myself basically the same question some time ago.
> > pipelining seems fairly impossible unless we ban joins on those
> > "plugins" completely.
>
> Not really, just they have to be "materialized" before joins, or
> streaming node has to be at the driving side of the join, so you can
> fetch one tuple and then join it to index or hash lookup

I thought these was code in the planner already that said: if node A
requires seeking of subnode B and B doesn't support that, insert
Materialize node. Naturally there's a cost to that, so it would favour
plans that did not require seeking...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Adding pipelining support to set returning functions

From

Hans-Juergen Schoenig

Date:

11 April 2008, 09:58:27

Hannu Krosing wrote:
> A question to all pg hackers
>
> Is anybody working on adding pipelining to set returning functions.
>
> How much effort would it take ?
>
> Where should I start digging ?
>   

i asked myself basically the same question some time ago.
pipelining seems fairly impossible unless we ban joins on those 
"plugins" completely.
i think this should be fine for your case (no need to join PL/proxy 
partitions) - what we want here is to re-unify data and sent it through 
centralized BI.


> BACKGROUND:
>
> AFAICS , currently set returning functions materialise their results
> before returning, as seen by this simple test:
>
> hannu=# select * from generate_series(1,10) limit 2;
>  generate_series 
> -----------------
>                1
>                2
> (2 rows)
>
> Time: 1.183 ms
>
>
> hannu=# select * from generate_series(1,10000000) limit 2;
>  generate_series 
> -----------------
>                1
>                2
> (2 rows)
>
> Time: 3795.032 ms
>
> being able to pipeline (generate results as needed) would enable several
> interesting techniques, especially if combined with pl/proxy or any
> other functions which stream external data.
>
> Applications and design patterns like http://telegraph.cs.berkeley.edu/
> or http://labs.google.com/papers/mapreduce.html would suddenly become
> very easy to implement.
>
> -----------------
> Hannu
>
>   

currently things like nodeSeqscan do SeqNext and so on - one records is 
passed on to the next level.
why not have a nodePlugin or so doing the same?
or maybe some additional calling convention for streaming functions...

e.g.:
CREATE STREAMING FUNCTION xy() RETURNS NEXT RECORD AS $$   return exactly one record to keep doing   return NULL to
mark"end of table"
 
$$ LANGUAGE 'any';

so - for those function no ...   WHILE ...      RETURN NEXT

but just one tuple per call ...
this would pretty much do it for this case.
i would not even call this a special case - whenever there is a LOT of 
data, this could make sense.
   best regards,
      hans

-- 
Cybertec Schönig & Schönig GmbH
PostgreSQL Solutions and Support
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Tel: +43/1/205 10 35 / 340
www.postgresql-support.de