Thread: plperl (7.5)

plperl (7.5)

From

elein

Date:

11 July 2004, 01:21:44

The new plperl returns sets by having 
the function return an array.

This requires that the entire array be
built before anything is returned.

It seems to me that that does not scale
very well.  The technique of RETURN NEXT;
scales much better.

For example, you maybe selecting rows,
doing a little formating and adding some
information and returning the rows as you
process them.  It the table you are selecting
is very large, you still have to hold the
results in memory to return them all at 
once.

Am I misunderstanding something or can
someone explain the reasoning?

In an ideal implementation both techniques
would be possible since returning the array
is kind of cool ;-)

--elein
============================================================
elein@varlena.com        Varlena, LLC        www.varlena.com
         PostgreSQL Consulting, Support & Training   

PostgreSQL General Bits   http://www.varlena.com/GeneralBits/
=============================================================
I have always depended on the [QA] of strangers.

Re: plperl (7.5)

From

Alvaro Herrera

Date:

11 July 2004, 02:06:11

On Sat, Jul 10, 2004 at 09:18:28PM -0700, elein wrote:
> The new plperl returns sets by having 
> the function return an array.
> 
> This requires that the entire array be
> built before anything is returned.
> 
> It seems to me that that does not scale
> very well.  The technique of RETURN NEXT;
> scales much better.

I think RETURN NEXT does the same thing anyway ... they just store
tuples in a Tuplestore and then the whole thing is returned.  So the
function actually doesn't return until the whole function is done.

The set-returning-function infraestructure actually has a mode on which
you can return one tuple per call, but PL/pgSQL uses only the other
mode, materializing the whole set before return.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"El conflicto es el camino real hacia la unión"

Re: plperl (7.5)

From

Tom Lane

Date:

11 July 2004, 12:19:40

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> On Sat, Jul 10, 2004 at 09:18:28PM -0700, elein wrote:
>> The new plperl returns sets by having 
>> the function return an array.

> I think RETURN NEXT does the same thing anyway ... they just store
> tuples in a Tuplestore and then the whole thing is returned.  So the
> function actually doesn't return until the whole function is done.

However, it's likely that the tuplestore infrastructure can deal
comfortably with sets far larger than a Perl array can.  (For one thing,
it will swap tuples out to a temp file on disk once the set size exceeds
work_mem.)  I think elein's concern is justified, unless someone can
produce a test case showing that plperl actually performs OK with a
large result set.

As a simple test for plpgsql's speed with such things, I tried

create function seq(int) returns setof int as '
begin for i in 1..$1 loop   return next i; end loop;
return;
end' language plpgsql;

regression=# \timing
Timing is on.
regression=# select count(*) from seq(100000);count  
--------100000
(1 row)

Time: 396.524 ms
regression=# select count(*) from seq(1000000); count  
---------1000000
(1 row)

Time: 3615.115 ms
regression=# select count(*) from seq(10000000); count   
----------10000000
(1 row)

Time: 40356.972 ms

My Perl is too rusty to immediately whip out the equivalent incantation
in plperl; would someone like to compare the timings on their own machine?
    regards, tom lane

Re: plperl (7.5)

From

elein

Date:

11 July 2004, 20:19:07

I'll try these in plperl when I get a chance, but there is still
a scalability problem.  

There was some work done at UCB a while back (post-postgres)
to try to speed up queries by making the first n rows available
quickly.  This is based on the googlish idea that people want
results fast and sometimes only want to see the first few results.

(I saw the return n rows fast before I heard of the google algorithm,
by some years, though.) (It may have been part of hellerstein, et al's
estimated aggregrate capabilities. It was around that time anyway.)

If any function is going to return a gazillion rows
there ought to be a way of doing that block by block.

When a query uses a cursor, is the entire result set always
materialized?  If so, perhaps this is just the way it is
for postgres.  If not, in the future perhaps there can be a
way to do this.  I do not know that part of the source, however.
And I suspect any sorted query would need to be materialized
anyway.

The other piece of the scalability/speed problem is copying
the data from the materialized result set into the return
structure. Theoretically this requires twice the memory of
the result set.  Is that actually true in reality?

Obviously these comments are not for 7.5 nor are they
make or break issues.

--elein 

On Sun, Jul 11, 2004 at 11:17:19AM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> > On Sat, Jul 10, 2004 at 09:18:28PM -0700, elein wrote:
> >> The new plperl returns sets by having 
> >> the function return an array.
> 
> > I think RETURN NEXT does the same thing anyway ... they just store
> > tuples in a Tuplestore and then the whole thing is returned.  So the
> > function actually doesn't return until the whole function is done.
> 
> However, it's likely that the tuplestore infrastructure can deal
> comfortably with sets far larger than a Perl array can.  (For one thing,
> it will swap tuples out to a temp file on disk once the set size exceeds
> work_mem.)  I think elein's concern is justified, unless someone can
> produce a test case showing that plperl actually performs OK with a
> large result set.
> 
> As a simple test for plpgsql's speed with such things, I tried
> 
> create function seq(int) returns setof int as '
> begin
>   for i in 1..$1 loop
>     return next i;
>   end loop;
> return;
> end' language plpgsql;
> 
> regression=# \timing
> Timing is on.
> regression=# select count(*) from seq(100000);
>  count  
> --------
>  100000
> (1 row)
> 
> Time: 396.524 ms
> regression=# select count(*) from seq(1000000);
>   count  
> ---------
>  1000000
> (1 row)
> 
> Time: 3615.115 ms
> regression=# select count(*) from seq(10000000);
>   count   
> ----------
>  10000000
> (1 row)
> 
> Time: 40356.972 ms
> 
> My Perl is too rusty to immediately whip out the equivalent incantation
> in plperl; would someone like to compare the timings on their own machine?
> 
>         regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

Re: plperl (7.5)

From

Oleg Bartunov

Date:

12 July 2004, 03:47:36

Several years ago me and Teodor have proposed partial sort for
top-k ranking result. We have even submitted a very crude patch, but it was
rejected. We use partial sorting extensively in our external application
and found it's very useful. There are many papers recently published about
top-k problem in database theory, but I don't know if any database has
such optimization.
Oleg
On Sun, 11 Jul 2004, elein wrote:

> I'll try these in plperl when I get a chance, but there is still
> a scalability problem.
>
> There was some work done at UCB a while back (post-postgres)
> to try to speed up queries by making the first n rows available
> quickly.  This is based on the googlish idea that people want
> results fast and sometimes only want to see the first few results.
>
> (I saw the return n rows fast before I heard of the google algorithm,
> by some years, though.) (It may have been part of hellerstein, et al's
> estimated aggregrate capabilities. It was around that time anyway.)
>
> If any function is going to return a gazillion rows
> there ought to be a way of doing that block by block.
>
> When a query uses a cursor, is the entire result set always
> materialized?  If so, perhaps this is just the way it is
> for postgres.  If not, in the future perhaps there can be a
> way to do this.  I do not know that part of the source, however.
> And I suspect any sorted query would need to be materialized
> anyway.
>
> The other piece of the scalability/speed problem is copying
> the data from the materialized result set into the return
> structure. Theoretically this requires twice the memory of
> the result set.  Is that actually true in reality?
>
> Obviously these comments are not for 7.5 nor are they
> make or break issues.
>
> --elein
>
>
> On Sun, Jul 11, 2004 at 11:17:19AM -0400, Tom Lane wrote:
> > Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> > > On Sat, Jul 10, 2004 at 09:18:28PM -0700, elein wrote:
> > >> The new plperl returns sets by having
> > >> the function return an array.
> >
> > > I think RETURN NEXT does the same thing anyway ... they just store
> > > tuples in a Tuplestore and then the whole thing is returned.  So the
> > > function actually doesn't return until the whole function is done.
> >
> > However, it's likely that the tuplestore infrastructure can deal
> > comfortably with sets far larger than a Perl array can.  (For one thing,
> > it will swap tuples out to a temp file on disk once the set size exceeds
> > work_mem.)  I think elein's concern is justified, unless someone can
> > produce a test case showing that plperl actually performs OK with a
> > large result set.
> >
> > As a simple test for plpgsql's speed with such things, I tried
> >
> > create function seq(int) returns setof int as '
> > begin
> >   for i in 1..$1 loop
> >     return next i;
> >   end loop;
> > return;
> > end' language plpgsql;
> >
> > regression=# \timing
> > Timing is on.
> > regression=# select count(*) from seq(100000);
> >  count
> > --------
> >  100000
> > (1 row)
> >
> > Time: 396.524 ms
> > regression=# select count(*) from seq(1000000);
> >   count
> > ---------
> >  1000000
> > (1 row)
> >
> > Time: 3615.115 ms
> > regression=# select count(*) from seq(10000000);
> >   count
> > ----------
> >  10000000
> > (1 row)
> >
> > Time: 40356.972 ms
> >
> > My Perl is too rusty to immediately whip out the equivalent incantation
> > in plperl; would someone like to compare the timings on their own machine?
> >
> >         regards, tom lane
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 7: don't forget to increase your free space map settings
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: plperl (7.5)

From

Mike Rylander

Date:

13 July 2004, 16:29:00

<posted & mailed>

Tom Lane wrote:

> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
>> On Sat, Jul 10, 2004 at 09:18:28PM -0700, elein wrote:
>>> The new plperl returns sets by having
>>> the function return an array.
> 
>> I think RETURN NEXT does the same thing anyway ... they just store
>> tuples in a Tuplestore and then the whole thing is returned.  So the
>> function actually doesn't return until the whole function is done.
> 
> However, it's likely that the tuplestore infrastructure can deal
> comfortably with sets far larger than a Perl array can.  (For one thing,
> it will swap tuples out to a temp file on disk once the set size exceeds
> work_mem.)  I think elein's concern is justified, unless someone can
> produce a test case showing that plperl actually performs OK with a
> large result set.
> 
> As a simple test for plpgsql's speed with such things, I tried
> 
> create function seq(int) returns setof int as '
> begin
>   for i in 1..$1 loop
>     return next i;
>   end loop;
> return;
> end' language plpgsql;
> 
> regression=# \timing
> Timing is on.
> regression=# select count(*) from seq(100000);
>  count
> --------
>  100000
> (1 row)
> 
> Time: 396.524 ms
> regression=# select count(*) from seq(1000000);
>   count
> ---------
>  1000000
> (1 row)
> 
> Time: 3615.115 ms
> regression=# select count(*) from seq(10000000);
>   count
> ----------
>  10000000
> (1 row)
> 
> Time: 40356.972 ms
> 
> My Perl is too rusty to immediately whip out the equivalent incantation
> in plperl; would someone like to compare the timings on their own machine?
> 

I don't have access to a machine with plperl installed, but it would be very
close to this:

create function seq(int) returns setof int as $$
my $count = shift;
my $ret = [];
for my $i ( 1 .. $count ) {       push @$ret, $i;
}
return $ret;
$$ language 'plperl';

... hmmm... the "push" line may need to be:
       push @$ret, { val => $i };

Hope it helps!

> regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

-- 
--miker

Re: plperl (7.5)

From

Andrew Dunstan

Date:

17 July 2004, 20:52:48

elein wrote:

>The new plperl returns sets by having 
>the function return an array.
>
>This requires that the entire array be
>built before anything is returned.
>
>It seems to me that that does not scale
>very well.  The technique of RETURN NEXT;
>scales much better.
>  
>

Indeed.

>For example, you maybe selecting rows,
>doing a little formating and adding some
>information and returning the rows as you
>process them.  It the table you are selecting
>is very large, you still have to hold the
>results in memory to return them all at 
>once.
>
>Am I misunderstanding something or can
>someone explain the reasoning?
>  
>

The reasoning behind the current set of new features is simple - what 
has been provided is what we were able to get done before feature 
freeze. It is not by any means all that is intended by the plperlng 
project. The rest will have to wait for another release.

>In an ideal implementation both techniques
>would be possible since returning the array
>is kind of cool ;-)
>
>
>  
>

Quite so. I think we would have to provide a callback procedure to add a 
resultset member, and we could easily have it set a flag so that if it 
had been used in the function call we would use that method of 
accumulating resultset members, otherwise we would use the current 
all-at-once method.

But before we embark on any such exercise, we need to have a good 
discussion of future features and APIs. My intention was to start that 
discussion after we put the 7.5 stuff to bed. Please feel free to join 
the plperlng-devel mailing list.

cheers

andrew