Re: short-cutting if sum()>constant - Mailing list pgsql-sql

From Ivan Sergio Borgonovo
Subject Re: short-cutting if sum()>constant
Date
Msg-id 20091223115628.1709caf5@dawn.webthatworks.it
Whole thread Raw
In response to Re: short-cutting if sum()>constant  (Pavel Stehule <pavel.stehule@gmail.com>)
Responses Re: short-cutting if sum()>constant  (Pavel Stehule <pavel.stehule@gmail.com>)
List pgsql-sql
On Wed, 23 Dec 2009 08:38:52 +0100
Pavel Stehule <pavel.stehule@gmail.com> wrote:

> > As even more unexpected... when all row are >0 and most of them
> > are equal to 1 the generate_series performs appreciably better
> > (roughly 15% faster).
> > And I think your version can be further optimised:
> > select count(*) from (select (generate_series(1,a))::int from
> > data limit 90000000) s;
> > This perform 30% faster.
> >
> > So what's so slow in the plpgsql version?
> 
> don't forget - plpgsql is interpret - it is best as glue for SQL

Yeah but how many times is it interpreted in a 1M cycle expecially
when you ran it more than one time and the interpreter should be
"hot"? I don't know how much "interpreter" work can be saved from a
previous run of the function... but still there should be something
left [1].
plpgsql isn't duck typed and that function doesn't leave too much
space for "interpretation".
Unless I'm missing something, in a typed language like plpgsql that
function could be easily turned into its C equivalent by the
interpreter.
I really thought the cost of running plpgsql was much lower in such
kind of situation.

Consider that the cost doesn't come from the initial cost of
interpreting the function but it is proportional to the numbers of
cycles.

Even hard coding the LIMIT in the plpgsql version doesn't make it
faster enough. So it is not the planner.

In a 10M rows dataset where 8332885 are =1 and the rest are =2:

If I'm testing for >900000
I get 940.580 ms vs 1302.162 ms
If I'm testing for >9000000
I get 8816.263 ms vs 12306.483 ms

BTW
select sum(a) from data; takes 1999.492 ms.
select count(*) from data; takes 1612.039 ms

While smart your way is definitively more convoluted and should
require much more cpu cycles and memory.

The plperl version:
create or replace function totano(int)
returns int as
$$       my $b = 0;       my $row;       my $sth = spi_query("select a from data");
while(defined($row=spi_fetchrow($sth))){               $b += $row->{a};               if($b>=$_[0]) {
   return $b;               }       }       return $b;
 
$$ LANGUAGE plperl;

Is 10 times slower than the plpgsql version.

> statement. I don't thing so plpgsql is slow - speed is similar to
> using buildin functionality. But I am sure, rewritening your
> function to C could help. If you need maximal speed.

In my case it is not worth. I was just looking into something that
didn't have to scan all the rows.
With the real distribution of data and no real workload on the box
your solution is faster but the difference while unexpectedly
appreciable is nearly negligible.

Anyway I'm not able to justify the difference in speed between
plpgsql and your solution in such case unless plpgsql is switching
back and forward between binary data and their text representation.
It would be nice if someone that know plpgsql internals explain
where the cost comes from.

> I thing, so there are other trick, I am not sure if it is faster.
> You can create own aggregate. In state function you can calculate
> and check state value. If it is over your limit, then you can raise
> exception. So if your query will be finished with custom exception,
> then sum(c) > n is true.

I may test it later.

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it



pgsql-sql by date:

Previous
From: "A. Kretschmer"
Date:
Subject: Re: Using || operator to fold multiple columns into one
Next
From: Pavel Stehule
Date:
Subject: Re: short-cutting if sum()>constant