Thread: short-cutting if sum()>constant

short-cutting if sum()>constant

From

Ivan Sergio Borgonovo

Date:

22 December 2009, 08:35:45

Hi,

I'd like to know if

select sum(qty) from t where status=37;

is > constant.

qty is always >0.

Is there a way to skip examining further rows and return a result
ASAP?


-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: short-cutting if sum()>constant

From

Filip Rembiałkowski

Date:

22 December 2009, 09:49:06

2009/12/22 Ivan Sergio Borgonovo <mail@webthatworks.it>

Hi,

Hi :-)

I'd like to know if

select sum(qty) from t where status=37;

is > constant.

qty is always >0.

Is there a way to skip examining further rows and return a result
ASAP?

With plain SQL, no.

With a user defined function in PL/PgSQL, yes.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/

Re: short-cutting if sum()>constant

From

Adam Tauno Williams

Date:

22 December 2009, 10:03:58

On Tue, 2009-12-22 at 13:35 +0100, Ivan Sergio Borgonovo wrote:
> Hi,
> I'd like to know if
> select sum(qty) from t where status=37;
> is > constant.
> qty is always >0.
> Is there a way to skip examining further rows and return a result
> ASAP?

SELECT SUM(object_version)
FROM date_x
WHERE owner_id = 10100
HAVING SUM(object_version) > 1000

?

-- 
OpenGroupware developer: awilliam@whitemice.org
<http://whitemiceconsulting.blogspot.com/>
OpenGroupare & Cyrus IMAPd documenation @
<http://docs.opengroupware.org/Members/whitemice/wmogag/file_view>

Re: short-cutting if sum()>constant

From

"A. Kretschmer"

Date:

22 December 2009, 10:13:07

In response to Ivan Sergio Borgonovo :
> Hi,
> 
> I'd like to know if
> 
> select sum(qty) from t where status=37;
> 
> is > constant.
> 
> qty is always >0.
> 
> Is there a way to skip examining further rows and return a result
> ASAP?

I think no.

But you can create a new table with 2 columns: status (primary key) and
a column for the sum of qty. And you need a TRIGGER: update this table
for every insert, update and delete on your table t.

Now you can ask _this_ new table if the sum(qty) > your constant, this
should work very fast.


Just an idea...


Andreas
-- 
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

Re: short-cutting if sum()>constant

From

msi77

Date:

22 December 2009, 11:11:44

Does PG support CTE?
You can try it.

> In response to Ivan Sergio Borgonovo :
> > Развернуть 
> > Hi,
> > 
> > I'd like to know if
> > 
> > select sum(qty) from t where status=37;
> > 
> > is > constant.
> > 
> > qty is always >0.
> > 
> > Is there a way to skip examining further rows and return a result
> > ASAP?
> I think no.
> But you can create a new table with 2 columns: status (primary key) and
> a column for the sum of qty. And you need a TRIGGER: update this table
> for every insert, update and delete on your table t.
> Now you can ask _this_ new table if the sum(qty) > your constant, this
> should work very fast.
> Just an idea...
> Andreas
> 

Яндекс.Почта. Письма есть. Спама - нет. http://mail.yandex.ru/nospam/sign

Re: short-cutting if sum()>constant

From

Ivan Sergio Borgonovo

Date:

22 December 2009, 13:23:02

On Tue, 22 Dec 2009 14:48:55 +0100
Filip Rembiałkowski <plk.zuber@gmail.com> wrote:

> With plain SQL, no.
>
> With a user defined function in PL/PgSQL, yes.

thanks to all.

I'm on 8.3 so no CTE.
I was hoping there was some way to write it in plain SQL.
I'm planning to wrap everything in a plpgsql function using cursors
and then switch to WITH when I'll move to 8.4.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: short-cutting if sum()>constant

From

Andreas Kretschmer

Date:

22 December 2009, 13:30:22

msi77 <msi77@yandex.ru> wrote:

> Does PG support CTE?

Since 8.4 yes.


> You can try it.

Sorry, but i don't know how a CTE can help in this case, can you explain
that?


Andreas
-- 
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°

Re: short-cutting if sum()>constant

From

msi77

Date:

22 December 2009, 15:22:15

> Sorry, but i don't know how a CTE can help in this case, can you explain

I mean RECURSIVE CTE. You can check your condition on each iteration and stop execution when condition is false.

Sergey

> msi77 <msi77@yandex.ru> wrote:
> > Развернуть 
> > Does PG support CTE?
> Since 8.4 yes.
> > Развернуть 
> > You can try it.
> Sorry, but i don't know how a CTE can help in this case, can you explain
> that?
> Andreas
> 

Здесь спама нет http://mail.yandex.ru/nospam/sign

Re: short-cutting if sum()>constant

From

Pavel Stehule

Date:

22 December 2009, 15:47:31

Hello

I found one ugly trick. You can multiply lines and SUM > cons could be
replaced limit clause:

postgres=# select * from data;a
---321423
(6 rows)

Then SELECT * FROM WHERE and stop when SUM(a) = n

then

postgres=# select generate_series(1,a) from data;generate_series
-----------------              1              2              3              1              2              1
1              2              3              4              1              2              1              2
3

So If I would to check if there are sum(a) >= 10 then I can use LIMIT
10. If query returns ten rows, then result is true, else result is
false

select a, (a = generate_series(1,a))::int from data limit 12; -- stop
after sum(a) = 12

postgres=# select sum(x) from (select 1 as x,(a =
generate_series(1,a))::int from data limit 12) s;sum
----- 12 --              12 is eq 12, so test is successful

(1 row)


Regards
Pavel Stehule

2009/12/22 Ivan Sergio Borgonovo <mail@webthatworks.it>:
> Hi,
>
> I'd like to know if
>
> select sum(qty) from t where status=37;
>
> is > constant.
>
> qty is always >0.
>
> Is there a way to skip examining further rows and return a result
> ASAP?
>
>
> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>

Re: short-cutting if sum()>constant

From

Ivan Sergio Borgonovo

Date:

22 December 2009, 19:00:54

On Tue, 22 Dec 2009 20:47:18 +0100
Pavel Stehule <pavel.stehule@gmail.com> wrote:

> Hello
> 
> I found one ugly trick. You can multiply lines and SUM > cons
> could be replaced limit clause:

The trick is really smart (and fun), kudos, really, it's always a
pleasure to read your solutions, thanks.

But as expected:

In a table with 100000 rows with random values [0,9]

test=# select sum(x) from (select 1 as x,(a =
generate_series(1,a))::int from data limit 100000) s; sum
--------100000
(1 row)

Time: 71.423 ms

test=# create or replace function tano(a int, out b int)
returns int as
$$
declare       row record;
beginb :=0;       for row in select a as _a from data       loop               b := row._a + b;               if (b>=a)
then                      return;               end if;       end loop;       return;
 
end;
$$ language plpgsql;

test=# select * from tano(100000);  b
--------100000
(1 row)

Time: 0.187 ms

I run both several times to avoid simple caching issues... anyway I
didn't really run a serious benchmark, but results were always in
the same order of magnitude.

I hope I didn't make any mistake.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: short-cutting if sum()>constant

From

Ivan Sergio Borgonovo

Date:

22 December 2009, 20:10:16

On Wed, 23 Dec 2009 00:00:31 +0100
Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:

> On Tue, 22 Dec 2009 20:47:18 +0100
> Pavel Stehule <pavel.stehule@gmail.com> wrote:
> 
> > Hello
> > 
> > I found one ugly trick. You can multiply lines and SUM > cons
> > could be replaced limit clause:
> 
> The trick is really smart (and fun), kudos, really, it's always a
> pleasure to read your solutions, thanks.
> 
> But as expected:

as unexpected...

> test=# create or replace function tano(a int, out b int)                                       ^^^ should be anything
buta
 
> returns int as
> $$
> declare
>         row record;
> begin
>     b :=0;
>         for row in select a as _a from data                        where a>0

>         loop
>                 b := row._a + b;
>                 if (b>=a) then
>                         return;
>                 end if;
>         end loop;
>         return;
> end;
> $$ language plpgsql;

Making it longer to better appreciate the difference: 1M rows where
a [0,2]

select * from tano((1000000)::int);   b
---------1000001
(1 row)

Time: 1235.243 ms

select sum(x) from (select 1 as x,(a =
generate_series(1,a))::int from data limit 1000000) s;  sum
---------1000000
(1 row)

Time: 1309.441 ms

Being fair once you add the where clause to the generate_series
version the difference in performance is negligible and saying that
the plpgsql version is faster would require some more serious
benchmarking.

Surprised! If the generate_series can compete with the plpgsql for
loop... why is the plpgsql version so "slow"?

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: short-cutting if sum()>constant

From

Ivan Sergio Borgonovo

Date:

22 December 2009, 21:49:03

On Wed, 23 Dec 2009 01:09:40 +0100
Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:

> On Wed, 23 Dec 2009 00:00:31 +0100
> Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
> 
> > On Tue, 22 Dec 2009 20:47:18 +0100
> > Pavel Stehule <pavel.stehule@gmail.com> wrote:
> > 
> > > Hello
> > > 
> > > I found one ugly trick. You can multiply lines and SUM > cons
> > > could be replaced limit clause:
> > 
> > The trick is really smart (and fun), kudos, really, it's always a
> > pleasure to read your solutions, thanks.
> > 
> > But as expected:
> 
> as unexpected...

As even more unexpected... when all row are >0 and most of them are
equal to 1 the generate_series performs appreciably better (roughly
15% faster).
And I think your version can be further optimised:
select count(*) from (select (generate_series(1,a))::int from
data limit 90000000) s;
This perform 30% faster.

So what's so slow in the plpgsql version?

Fortunately as expected when "enough" rows are >1 the for loop
solution perform much better.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: short-cutting if sum()>constant

From

Pavel Stehule

Date:

23 December 2009, 03:39:05

2009/12/23 Ivan Sergio Borgonovo <mail@webthatworks.it>:
> On Wed, 23 Dec 2009 01:09:40 +0100
> Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
>
>> On Wed, 23 Dec 2009 00:00:31 +0100
>> Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
>>
>> > On Tue, 22 Dec 2009 20:47:18 +0100
>> > Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> >
>> > > Hello
>> > >
>> > > I found one ugly trick. You can multiply lines and SUM > cons
>> > > could be replaced limit clause:
>> >
>> > The trick is really smart (and fun), kudos, really, it's always a
>> > pleasure to read your solutions, thanks.
>> >
>> > But as expected:
>>
>> as unexpected...
>
> As even more unexpected... when all row are >0 and most of them are
> equal to 1 the generate_series performs appreciably better (roughly
> 15% faster).
> And I think your version can be further optimised:
> select count(*) from (select (generate_series(1,a))::int from
> data limit 90000000) s;
> This perform 30% faster.
>
> So what's so slow in the plpgsql version?

don't forget - plpgsql is interpret - it is best as glue for SQL
statement. I don't thing so plpgsql is slow - speed is similar to
using buildin functionality. But I am sure, rewritening your function
to C could help. If you need maximal speed.

I thing, so there are other trick, I am not sure if it is faster. You
can create own aggregate. In state function you can calculate and
check state value. If it is over your limit, then you can raise
exception. So if your query will be finished with custom exception,
then sum(c) > n is true.

Regards
Pavel Stehule


>
> Fortunately as expected when "enough" rows are >1 the for loop
> solution perform much better.
>
> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>

Re: short-cutting if sum()>constant

From

Ivan Sergio Borgonovo

Date:

23 December 2009, 06:56:52

On Wed, 23 Dec 2009 08:38:52 +0100
Pavel Stehule <pavel.stehule@gmail.com> wrote:

> > As even more unexpected... when all row are >0 and most of them
> > are equal to 1 the generate_series performs appreciably better
> > (roughly 15% faster).
> > And I think your version can be further optimised:
> > select count(*) from (select (generate_series(1,a))::int from
> > data limit 90000000) s;
> > This perform 30% faster.
> >
> > So what's so slow in the plpgsql version?
> 
> don't forget - plpgsql is interpret - it is best as glue for SQL

Yeah but how many times is it interpreted in a 1M cycle expecially
when you ran it more than one time and the interpreter should be
"hot"? I don't know how much "interpreter" work can be saved from a
previous run of the function... but still there should be something
left [1].
plpgsql isn't duck typed and that function doesn't leave too much
space for "interpretation".
Unless I'm missing something, in a typed language like plpgsql that
function could be easily turned into its C equivalent by the
interpreter.
I really thought the cost of running plpgsql was much lower in such
kind of situation.

Consider that the cost doesn't come from the initial cost of
interpreting the function but it is proportional to the numbers of
cycles.

Even hard coding the LIMIT in the plpgsql version doesn't make it
faster enough. So it is not the planner.

In a 10M rows dataset where 8332885 are =1 and the rest are =2:

If I'm testing for >900000
I get 940.580 ms vs 1302.162 ms
If I'm testing for >9000000
I get 8816.263 ms vs 12306.483 ms

BTW
select sum(a) from data; takes 1999.492 ms.
select count(*) from data; takes 1612.039 ms

While smart your way is definitively more convoluted and should
require much more cpu cycles and memory.

The plperl version:
create or replace function totano(int)
returns int as
$$       my $b = 0;       my $row;       my $sth = spi_query("select a from data");
while(defined($row=spi_fetchrow($sth))){               $b += $row->{a};               if($b>=$_[0]) {
   return $b;               }       }       return $b;

$$ LANGUAGE plperl;

Is 10 times slower than the plpgsql version.

> statement. I don't thing so plpgsql is slow - speed is similar to
> using buildin functionality. But I am sure, rewritening your
> function to C could help. If you need maximal speed.

In my case it is not worth. I was just looking into something that
didn't have to scan all the rows.
With the real distribution of data and no real workload on the box
your solution is faster but the difference while unexpectedly
appreciable is nearly negligible.

Anyway I'm not able to justify the difference in speed between
plpgsql and your solution in such case unless plpgsql is switching
back and forward between binary data and their text representation.
It would be nice if someone that know plpgsql internals explain
where the cost comes from.

> I thing, so there are other trick, I am not sure if it is faster.
> You can create own aggregate. In state function you can calculate
> and check state value. If it is over your limit, then you can raise
> exception. So if your query will be finished with custom exception,
> then sum(c) > n is true.

I may test it later.

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: short-cutting if sum()>constant

From

Pavel Stehule

Date:

23 December 2009, 07:53:06

2009/12/23 Ivan Sergio Borgonovo <mail@webthatworks.it>:
> On Wed, 23 Dec 2009 08:38:52 +0100
> Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>> > As even more unexpected... when all row are >0 and most of them
>> > are equal to 1 the generate_series performs appreciably better
>> > (roughly 15% faster).
>> > And I think your version can be further optimised:
>> > select count(*) from (select (generate_series(1,a))::int from
>> > data limit 90000000) s;
>> > This perform 30% faster.
>> >
>> > So what's so slow in the plpgsql version?
>>
>> don't forget - plpgsql is interpret - it is best as glue for SQL
>
> Yeah but how many times is it interpreted in a 1M cycle expecially
> when you ran it more than one time and the interpreter should be
> "hot"? I don't know how much "interpreter" work can be saved from a
> previous run of the function... but still there should be something
> left [1].
> plpgsql isn't duck typed and that function doesn't leave too much
> space for "interpretation".
> Unless I'm missing something, in a typed language like plpgsql that
> function could be easily turned into its C equivalent by the
> interpreter.
> I really thought the cost of running plpgsql was much lower in such
> kind of situation.

The problem is massive cycle. Plpgsql really isn't language for this.
This interpret missing own arithmetic unit. So every expression is
translated to SELECT statement

IF a > c ~ SELECT a > c
a = a + 1 ~ SELECT a + 1

these queries are executed in some special mode, but still it is more
expensive than C a = a + 1

>
> Consider that the cost doesn't come from the initial cost of
> interpreting the function but it is proportional to the numbers of
> cycles.
>
> Even hard coding the LIMIT in the plpgsql version doesn't make it
> faster enough. So it is not the planner.
>
> In a 10M rows dataset where 8332885 are =1 and the rest are =2:
>
> If I'm testing for >900000
> I get 940.580 ms vs 1302.162 ms
> If I'm testing for >9000000
> I get 8816.263 ms vs 12306.483 ms
>
> BTW
> select sum(a) from data; takes 1999.492 ms.
> select count(*) from data; takes 1612.039 ms
>

it is slower, because there is one NULL test more.

> While smart your way is definitively more convoluted and should
> require much more cpu cycles and memory.
>
> The plperl version:
> create or replace function totano(int)
> returns int as
> $$
>        my $b = 0;
>        my $row;
>        my $sth = spi_query("select a from data");
>        while(defined($row=spi_fetchrow($sth))) {
>                $b += $row->{a};
>                if($b>=$_[0]) {
>                        return $b;
>                }
>        }
>        return $b;
> $$ LANGUAGE plperl;
>
> Is 10 times slower than the plpgsql version.
>

PLpgSQL quietly uses cursors for FOR SELECT, your plperl uses normal
standard recordset - it means so your query generate full scan. In
your plperl code, you have to use explicit CURSORS. Then the speed
should be comparable. Plperlu is little bit faster, but plpgsql has
direct cursor support.


>> statement. I don't thing so plpgsql is slow - speed is similar to
>> using buildin functionality. But I am sure, rewritening your
>> function to C could help. If you need maximal speed.
>
> In my case it is not worth. I was just looking into something that
> didn't have to scan all the rows.
> With the real distribution of data and no real workload on the box
> your solution is faster but the difference while unexpectedly
> appreciable is nearly negligible.
>
> Anyway I'm not able to justify the difference in speed between
> plpgsql and your solution in such case unless plpgsql is switching
> back and forward between binary data and their text representation.
> It would be nice if someone that know plpgsql internals explain
> where the cost comes from.

I thing so most expensive part in your plpgsql code is two expression.

Regards
Pavel

>
>> I thing, so there are other trick, I am not sure if it is faster.
>> You can create own aggregate. In state function you can calculate
>> and check state value. If it is over your limit, then you can raise
>> exception. So if your query will be finished with custom exception,
>> then sum(c) > n is true.
>
> I may test it later.
>
> thanks
>
> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>

Re: short-cutting if sum()>constant

From

Ivan Sergio Borgonovo

Date:

23 December 2009, 09:01:34

On Wed, 23 Dec 2009 12:52:38 +0100
Pavel Stehule <pavel.stehule@gmail.com> wrote:

> The problem is massive cycle. Plpgsql really isn't language for
> this. This interpret missing own arithmetic unit. So every
> expression is translated to SELECT statement
> 
> IF a > c ~ SELECT a > c
> a = a + 1 ~ SELECT a + 1

> these queries are executed in some special mode, but still it is
> more expensive than C a = a + 1

I didn't get it.
I'd expect that since plpgsql should shere SQL data types it could
map easily something like

if (a>b) then
or
a := a +b

and something like a for in query loop was going to be highly
optimised as well.
plpgsql should be the most tightly bounded language to the internals
of pg SQL.

> > select sum(a) from data; takes 1999.492 ms.
> > select count(*) from data; takes 1612.039 ms

> it is slower, because there is one NULL test more.

That didn't came as a surprise. It was there for comparison.

> PLpgSQL quietly uses cursors for FOR SELECT, your plperl uses

I didn't notice. That was my first plperl function in my life.
Anyway that means that that test didn't say anything about
interpreter speed and duck typing.

> > Anyway I'm not able to justify the difference in speed between
> > plpgsql and your solution in such case unless plpgsql is
> > switching back and forward between binary data and their text
> > representation. It would be nice if someone that know plpgsql
> > internals explain where the cost comes from.

> I thing so most expensive part in your plpgsql code is two
> expression.

I didn't understand.

What's the reason the interpreter can't translate the if and the b :=
row._a + b; into something that very resemble compiled C?
plpgsql is not doing anything different than:
select count(*) from (select case when a>3 then 1 else 2 end, a+a
from data limit 9000000) as f;
One of the advantages of using plpgsql (and one annoyance too) is
that variables are declared and plpgsql should know how to operate
on them with native C code.

BTW the best performer considering all constraints and data
distribution seems to be the simplest one:

select sum(s.a) from (select a from data where a>0 limit 9000000) s;
Time: 2620.677 ms

We've no negative int... and they are int, so they have to be no
less than 1 if they are not 0. If I know there are no 0, the
simplest version become even faster.

When I randomly pick up values in [1,3] plpgsql and generate_series
start to perform similarly but still the simplest version is leading.

When the interval is in [1,6] the plpgsql becomes faster than the
generate_series but the simplest version is still leading.

Just when the interval is in [1,10] the plpgsql version and the
simplest one becomes comparable.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: short-cutting if sum()>constant

From

Pavel Stehule

Date:

23 December 2009, 09:35:37

2009/12/23 Ivan Sergio Borgonovo <mail@webthatworks.it>:
> On Wed, 23 Dec 2009 12:52:38 +0100
> Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>> The problem is massive cycle. Plpgsql really isn't language for
>> this. This interpret missing own arithmetic unit. So every
>> expression is translated to SELECT statement
>>
>> IF a > c ~ SELECT a > c
>> a = a + 1 ~ SELECT a + 1
>
>> these queries are executed in some special mode, but still it is
>> more expensive than C a = a + 1
>
> I didn't get it.
> I'd expect that since plpgsql should shere SQL data types it could
> map easily something like
>
> if (a>b) then
> or
> a := a +b
>
> and something like a for in query loop was going to be highly
> optimised as well.
> plpgsql should be the most tightly bounded language to the internals
> of pg SQL.

no there are not any internal optimisation. PostgreSQL can optimize
only execution plan - not individual expressions.

>
>> > select sum(a) from data; takes 1999.492 ms.
>> > select count(*) from data; takes 1612.039 ms
>
>> it is slower, because there is one NULL test more.
>
> That didn't came as a surprise. It was there for comparison.
>
>> PLpgSQL quietly uses cursors for FOR SELECT, your plperl uses
>
> I didn't notice. That was my first plperl function in my life.
> Anyway that means that that test didn't say anything about
> interpreter speed and duck typing.

you don't compare equal things.

>
>> > Anyway I'm not able to justify the difference in speed between
>> > plpgsql and your solution in such case unless plpgsql is
>> > switching back and forward between binary data and their text
>> > representation. It would be nice if someone that know plpgsql
>> > internals explain where the cost comes from.
>
>> I thing so most expensive part in your plpgsql code is two
>> expression.
>

one expression: if >> loop_var > parameter <<
second expression: loop_var + 1

> I didn't understand.
>
> What's the reason the interpreter can't translate the if and the b :=
> row._a + b; into something that very resemble compiled C?

a) simplicity. PLpgSQL interpret is very simple.
b) mostly types are little bit different behave than natural C types -
domains are different, C types doesn't know NULL value, ...

so if you like maximum speed, then you can use C language. It is
relative simple, much simpler than C procedures in T-SQL or Oracle.

> plpgsql is not doing anything different than:
> select count(*) from (select case when a>3 then 1 else 2 end, a+a
> from data limit 9000000) as f;

no - it doesn't any intelligence - it doesn't join expression
together. It does exactly it, what you write.

see http://okbob.blogspot.com/2009/12/how-dont-write-plpgsql-functions.html


> One of the advantages of using plpgsql (and one annoyance too) is
> that variables are declared and plpgsql should know how to operate
> on them with native C code.

It little bit nonsense. On 99% plpgsql use SPI api and work with
variables via query interface. PLpgSQL can do loop statement, if
statement, but for all others use internal query based API.

>
> BTW the best performer considering all constraints and data
> distribution seems to be the simplest one:
>
> select sum(s.a) from (select a from data where a>0 limit 9000000) s;
> Time: 2620.677 ms
>

this query is little but different, than you original request, but it
could work for you.

Regard
Pavel Stehule

> We've no negative int... and they are int, so they have to be no
> less than 1 if they are not 0. If I know there are no 0, the
> simplest version become even faster.
>
> When I randomly pick up values in [1,3] plpgsql and generate_series
> start to perform similarly but still the simplest version is leading.
>
> When the interval is in [1,6] the plpgsql becomes faster than the
> generate_series but the simplest version is still leading.
>
> Just when the interval is in [1,10] the plpgsql version and the
> simplest one becomes comparable.
>
> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>

Re: short-cutting if sum()>constant

From

Ivan Sergio Borgonovo

Date:

23 December 2009, 10:27:46

On Wed, 23 Dec 2009 14:35:27 +0100
Pavel Stehule <pavel.stehule@gmail.com> wrote:

> a) simplicity. PLpgSQL interpret is very simple.
> b) mostly types are little bit different behave than natural C
> types - domains are different, C types doesn't know NULL value, ...

But well... there should be a mapping somewhere between SQL int and
C and it can be plainly reused.

$b = 5 + "10";

is going to be far more problematic to execute... but I guess that's
exactly the problem.

create or replace function tano(z int, out b int) returns int as
$$
declare       x varchar(4) = '12';       y int = 10;
begin       b := x + y;       return;
end;
$$ language plpgsql;

select * from tano(1);
ERROR:  operator does not exist: character varying + integer
LINE 1: SELECT   $1  +  $2                    ^

From what I can guess... the interpreter is sending SQL commands to
pg and waiting result back. So actually the "interpreter" isn't
actually aware of data types... it just relies on SQL to spot data
type mismatch.
That means... that everything is converted back and forward to
*text*.
On the other side once a SQL command is interpreted it actually
knows what data types are and can pretty quickly sum int to int just
taking care of nulls and overflows that would be anyway problems for
any other language trying to sum "SQL ints".

Even if it had to sum int and bigint the code could be optimised for
just that and the need of a cast should be known in advance before
every "loop".

Did I get it?

That's what you were trying to make me understand with:

> It little bit nonsense. On 99% plpgsql use SPI api and work with
> variables via query interface. PLpgSQL can do loop statement, if
> statement, but for all others use internal query based API.

> this query is little but different, than you original request, but
> it could work for you.

Yep... making clear a is an int simplify the problem quite a lot.
But you couldn't use generate_series if a was not an int.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: short-cutting if sum()>constant

From

Craig Ringer

Date:

23 December 2009, 12:02:56

Pavel Stehule wrote:

> The problem is massive cycle. Plpgsql really isn't language for this.
> This interpret missing own arithmetic unit. So every expression is
> translated to SELECT statement
> 
> IF a > c ~ SELECT a > c
> a = a + 1 ~ SELECT a + 1
> 
> these queries are executed in some special mode, but still it is more
> expensive than C a = a + 1

... and may have different rules, so you can't just write a simple "map
expressions to C equivalents" arithmetic evaluator.

It's also far from easy to just translate PL/PgSQL to directly
machine-executable code as you suggested, by the way. It'd really
require a just-in-time compiler akin to what Java uses, though the
ability to compile once and cache would help get rid of some of the
complexity of Java's.

It'd quickly become attractive to just use PL/Java instead, or write
your own C-language function and LOAD it.

--
Craig Ringer

Re: short-cutting if sum()>constant

From

Pavel Stehule

Date:

23 December 2009, 12:04:46

2009/12/23 Ivan Sergio Borgonovo <mail@webthatworks.it>:
> On Wed, 23 Dec 2009 14:35:27 +0100
> Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>
>> a) simplicity. PLpgSQL interpret is very simple.
>> b) mostly types are little bit different behave than natural C
>> types - domains are different, C types doesn't know NULL value, ...
>
> But well... there should be a mapping somewhere between SQL int and
> C and it can be plainly reused.
>
> $b = 5 + "10";
>
> is going to be far more problematic to execute... but I guess that's
> exactly the problem.
>
> create or replace function tano(z int, out b int) returns int as
> $$
> declare
>        x varchar(4) = '12';
>        y int = 10;
> begin
>        b := x + y;
>        return;
> end;
> $$ language plpgsql;
>
> select * from tano(1);
> ERROR:  operator does not exist: character varying + integer
> LINE 1: SELECT   $1  +  $2
>                     ^
>
> From what I can guess... the interpreter is sending SQL commands to
> pg and waiting result back. So actually the "interpreter" isn't
> actually aware of data types... it just relies on SQL to spot data
> type mismatch.
> That means... that everything is converted back and forward to
> *text*.

no - not everything. plpgsql hold values in PostgreSQL native types.
But these types are not 100% equal to C types. Integer is +- equal to
C int. Varchar is absolutly different then C string.

> On the other side once a SQL command is interpreted it actually
> knows what data types are and can pretty quickly sum int to int just
> taking care of nulls and overflows that would be anyway problems for
> any other language trying to sum "SQL ints".
>
> Even if it had to sum int and bigint the code could be optimised for
> just that and the need of a cast should be known in advance before
> every "loop".

it could be, but it isn't.

PLpgSQL is very simple - you have to know, so every function is
"recompiled" everytime when function is called first time in session.
So there are not time for full optimalisations like C languages.
PLpgSQL do fast non optimalised execution - like Pascal. If you need
well optimized code, then you have to use C language and external
stored procedures. PLpgSQL is best as glue of SQL statements. Not for
numeric calculation, complex string operations.

Regards
Pavel Stehule



>
> Did I get it?
>
> That's what you were trying to make me understand with:
>
>> It little bit nonsense. On 99% plpgsql use SPI api and work with
>> variables via query interface. PLpgSQL can do loop statement, if
>> statement, but for all others use internal query based API.
>
>> this query is little but different, than you original request, but
>> it could work for you.
>
> Yep... making clear a is an int simplify the problem quite a lot.
> But you couldn't use generate_series if a was not an int.
>
> thanks
>
> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>

Re: short-cutting if sum()>constant

From

Tom Lane

Date:

23 December 2009, 12:36:39

Craig Ringer <craig@postnewspapers.com.au> writes:
> Pavel Stehule wrote:
>> these queries are executed in some special mode, but still it is more
>> expensive than C a = a + 1

> ... and may have different rules, so you can't just write a simple "map
> expressions to C equivalents" arithmetic evaluator.

Yeah.  As an example, overflow is supposed to be caught in "a + 1",
unlike what would happen in C.

In principle you could map some of the builtin operators into inline
code, but it would be a great deal of work and the results would be
un-portable.
        regards, tom lane

Re: short-cutting if sum()>constant

From

Ivan Sergio Borgonovo

Date:

23 December 2009, 13:45:47

On Wed, 23 Dec 2009 11:36:31 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Craig Ringer <craig@postnewspapers.com.au> writes:
> > Pavel Stehule wrote:
> >> these queries are executed in some special mode, but still it
> >> is more expensive than C a = a + 1
> 
> > ... and may have different rules, so you can't just write a
> > simple "map expressions to C equivalents" arithmetic evaluator.

> Yeah.  As an example, overflow is supposed to be caught in "a + 1",
> unlike what would happen in C.

> In principle you could map some of the builtin operators into
> inline code, but it would be a great deal of work and the results
> would be un-portable.

Tank you all for being so patient...
I really miss how actually procedural languages works internally.

doesn't pg routinely map between SQL and C?

What is the difference between

select a+a from data;
and
a := a + a;
in a plpgsql function?

plpgsql knows that a are eg. int so it could just use the same C
code that it uses when it has to sum a+a in sql.

My guess since I don't even know what to look for to get an idea of
the internal working of plpgsql is that the interpreter translate
the code into SQL (sort of...), it sends it to the parser through
SPI_execute/prepare etc... (so yeah maybe for the "data" it is not
really sending "text" representation of data) but still... the
"code" has to be further interpreted...

So something like:
a := a + a;
turns out to be:
SPI_prepare("SELECT $1 + $2", 2, ...);
and this is going to be called for every loop.
while I thought the SQL engine and plpgsql interpreter were nearer
so that the interpreter could push directly in the SQL engine the
values of a.

Am I getting nearer?

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: short-cutting if sum()>constant

From

Pavel Stehule

Date:

23 December 2009, 14:54:05

2009/12/23 Ivan Sergio Borgonovo <mail@webthatworks.it>:
> On Wed, 23 Dec 2009 11:36:31 -0500
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> Craig Ringer <craig@postnewspapers.com.au> writes:
>> > Pavel Stehule wrote:
>> >> these queries are executed in some special mode, but still it
>> >> is more expensive than C a = a + 1
>>
>> > ... and may have different rules, so you can't just write a
>> > simple "map expressions to C equivalents" arithmetic evaluator.
>
>> Yeah.  As an example, overflow is supposed to be caught in "a + 1",
>> unlike what would happen in C.
>
>> In principle you could map some of the builtin operators into
>> inline code, but it would be a great deal of work and the results
>> would be un-portable.
>
> Tank you all for being so patient...
> I really miss how actually procedural languages works internally.
>
> doesn't pg routinely map between SQL and C?

generally yes, but not directly.

for example: operator + for type integer is wrapped by function int4pl

http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/int.c?rev=1.86

After parsing and transformations, the expression is transformed to
sequence of call functions like int4pl. PostgreSQL doesn't contains C
compiler - so it cannot compile to target code.

>
> What is the difference between
>
> select a+a from data;
> and
> a := a + a;
> in a plpgsql function?

nothing,

you have to know so plpgsql doesn't see expression a+a; An content of
any expression is invisible for plpgsql parser. PLpgSQL doesn't
understand to expressions. PLpgSQL knows so somewhere have to be
expression, or so somewhere have to boolean expression, but own
expression is black box for plpgsql interpret.

>
> plpgsql knows that a are eg. int so it could just use the same C
> code that it uses when it has to sum a+a in sql.

PLpgSQL knows it. But this knowledge isn't enough. You have to have a
real compiler to machine code. But PostgreSQL hasn't real compiler -
it is only set of some specialised interprets. There are SQL
interpret, there are PLpgSQL interpret. Nothing is translated to
machine code.

>
> My guess since I don't even know what to look for to get an idea of
> the internal working of plpgsql is that the interpreter translate
> the code into SQL (sort of...), it sends it to the parser through
> SPI_execute/prepare etc... (so yeah maybe for the "data" it is not
> really sending "text" representation of data) but still... the
> "code" has to be further interpreted...
>

+/-

plpgsql uses cached plans. So SPI_prepare is called only when
expression is first time evaluated (for session).

> So something like:
> a := a + a;
> turns out to be:
> SPI_prepare("SELECT $1 + $2", 2, ...);
> and this is going to be called for every loop.
> while I thought the SQL engine and plpgsql interpreter were nearer
> so that the interpreter could push directly in the SQL engine the
> values of a.
>
> Am I getting nearer?

there are two steps:

if (cached_plan == NULL)  cached_plan = prepare("SELECT $1 + $2, info_about_types[])

result = exec(cached_plan, values[], nulls[], &isnull)
....
some_like_move_result_to_variable(address_of_a)

Pavel

>
> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>