Thread: Adding pipelining support to set returning functions
A question to all pg hackers Is anybody working on adding pipelining to set returning functions. How much effort would it take ? Where should I start digging ? BACKGROUND: AFAICS , currently set returning functions materialise their results before returning, as seen by this simple test: hannu=# select * from generate_series(1,10) limit 2;generate_series ----------------- 1 2 (2 rows) Time: 1.183 ms hannu=# select * from generate_series(1,10000000) limit 2;generate_series ----------------- 1 2 (2 rows) Time: 3795.032 ms being able to pipeline (generate results as needed) would enable several interesting techniques, especially if combined with pl/proxy or any other functions which stream external data. Applications and design patterns like http://telegraph.cs.berkeley.edu/ or http://labs.google.com/papers/mapreduce.html would suddenly become very easy to implement. ----------------- Hannu
On Sun, Apr 06, 2008 at 10:01:20AM +0300, Hannu Krosing wrote: > A question to all pg hackers > > Is anybody working on adding pipelining to set returning functions. > > How much effort would it take ? > > Where should I start digging ? > > BACKGROUND: > > AFAICS , currently set returning functions materialise their results > before returning, as seen by this simple test: +1 for doing this. :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hannu Krosing wrote: > A question to all pg hackers > > Is anybody working on adding pipelining to set returning functions. Not as far as I know. > How much effort would it take ? > > Where should I start digging ? I don't remember all the details, but I think the original SRF patch that I did was pipelined rather that materialized, and there were difficult issues that led to going with the materialized version "first", with a pipelined version to "eventually" follow. SFRM_Materialize is for the former and SFRM_ValuePerCall was supposed to be for the latter. The first thread was this one: http://archives.postgresql.org/pgsql-hackers/2002-04/msg01201.php Joe
On Fri, 2008-04-11 at 10:57 +0200, Hans-Juergen Schoenig wrote: > Hannu Krosing wrote: > > A question to all pg hackers > > > > Is anybody working on adding pipelining to set returning functions. > > > > How much effort would it take ? > > > > Where should I start digging ? > > > > i asked myself basically the same question some time ago. > pipelining seems fairly impossible unless we ban joins on those > "plugins" completely. Not really, just they have to be "materialized" before joins, or streaming node has to be at the driving side of the join, so you can fetch one tuple and then join it to index or hash lookup > i think this should be fine for your case (no need to join PL/proxy > partitions) - what we want here is to re-unify data and sent it through > centralized BI. In PL/Proxy context I was aiming at sorting data at nodes and then being able to merge several partitions while preserving order, and doing this without having to store N*partition_size rows in resultset. > > ... > > > > currently things like nodeSeqscan do SeqNext and so on - one records is > passed on to the next level. > why not have a nodePlugin or so doing the same? > or maybe some additional calling convention for streaming functions... > > e.g.: > CREATE STREAMING FUNCTION xy() RETURNS NEXT RECORD AS $$ > return exactly one record to keep doing > return NULL to mark "end of table" > $$ LANGUAGE 'any'; > > so - for those function no ... > WHILE ... > RETURN NEXT > > but just one tuple per call ... > this would pretty much do it for this case. > i would not even call this a special case - whenever there is a LOT of > data, this could make sense. In python (an also javascript starting at version 1.7) you do it by returning a generator from a function, which is done by using YIELD instead of return. >>> def numgen(i): ... while 1: ... yield i ... i += 1 >>> ng = numgen(1) >>> ng <generator object at 0xb7ce3bcc> >>> ng.next() 1 >>> ng.next() 2 In fact any pl/python function SRF puts its result set to retun buffer using generator mechanisms, even in case you return the result from function as a list or an array. What would be nice, is to wire the python generator directly to postgreSQLs FuncNext call At C function level this should probably be a mirror image of AGGREGATE functions, where you have a init() function that prepares some opaque data structure and next() for getting records with some special value for end, and preferrably also some finalize() to clean up in case postgresql stops before next() indicated EOD. Maybe some extra info would be nice for optimized, like expected rowcouunt or that data is returned sorted on some field. This would be good for current return mechanisms as well. ----------------- Hannu
On Fri, Apr 11, 2008 at 01:00:04PM +0300, Hannu Krosing wrote: > > i asked myself basically the same question some time ago. > > pipelining seems fairly impossible unless we ban joins on those > > "plugins" completely. > > Not really, just they have to be "materialized" before joins, or > streaming node has to be at the driving side of the join, so you can > fetch one tuple and then join it to index or hash lookup I thought these was code in the planner already that said: if node A requires seeking of subnode B and B doesn't support that, insert Materialize node. Naturally there's a cost to that, so it would favour plans that did not require seeking... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Hannu Krosing wrote: > A question to all pg hackers > > Is anybody working on adding pipelining to set returning functions. > > How much effort would it take ? > > Where should I start digging ? > i asked myself basically the same question some time ago. pipelining seems fairly impossible unless we ban joins on those "plugins" completely. i think this should be fine for your case (no need to join PL/proxy partitions) - what we want here is to re-unify data and sent it through centralized BI. > BACKGROUND: > > AFAICS , currently set returning functions materialise their results > before returning, as seen by this simple test: > > hannu=# select * from generate_series(1,10) limit 2; > generate_series > ----------------- > 1 > 2 > (2 rows) > > Time: 1.183 ms > > > hannu=# select * from generate_series(1,10000000) limit 2; > generate_series > ----------------- > 1 > 2 > (2 rows) > > Time: 3795.032 ms > > being able to pipeline (generate results as needed) would enable several > interesting techniques, especially if combined with pl/proxy or any > other functions which stream external data. > > Applications and design patterns like http://telegraph.cs.berkeley.edu/ > or http://labs.google.com/papers/mapreduce.html would suddenly become > very easy to implement. > > ----------------- > Hannu > > currently things like nodeSeqscan do SeqNext and so on - one records is passed on to the next level. why not have a nodePlugin or so doing the same? or maybe some additional calling convention for streaming functions... e.g.: CREATE STREAMING FUNCTION xy() RETURNS NEXT RECORD AS $$ return exactly one record to keep doing return NULL to mark"end of table" $$ LANGUAGE 'any'; so - for those function no ... WHILE ... RETURN NEXT but just one tuple per call ... this would pretty much do it for this case. i would not even call this a special case - whenever there is a LOT of data, this could make sense. best regards, hans -- Cybertec Schönig & Schönig GmbH PostgreSQL Solutions and Support Gröhrmühlgasse 26, A-2700 Wiener Neustadt Tel: +43/1/205 10 35 / 340 www.postgresql-support.de