Thread: Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

Hi everyone,

I've written a new open source tool for easily parallelising SQL scripts in postgres.   [obligatory plug:
https://github.com/gbb/par_psql  ] 

Using it, I'm seeing a problem I've seen in other postgres projects involving parallelisation in the last 12 months.

Basically:

- I have machines here with up to 16 CPUs and 128GB memory, very fast SSDs and controller etc, carefully configured
kernel/postgresql.conffor high performance. 

- Ordinary queries parallelise nearly perfectly (e.g. SELECT some_stuff ...), e.g. almost up to 16x performance
improvement.

- Calls to CPU-intensive user-defined pl/pgsql functions (e.g. SELECT myfunction(some_stuff)) do not parallelise well,
evenwhen they are independent or accessing tables in a read-only way. They hit a limit at 2.5x performance improvement
relativeto single-CPU performance (pg9.4) and 2x performance (pg9.3). This is about 6 times slower than I'm expecting.  

- Can't see what would be locking. It seems like it's the pl/pgsql environment itself that is somehow locking or
incurringsome huge frictional costs. Whether I use independently defined functions, independent source tables,
independentoutput tables, makes no difference whatsoever, so it doesn't feel 'locky'. It also doesn't seem to be
WAL/synchronisationrelated, as the machines I'm using can hit absurdly high pgbench rates, and I'm using unlogged
tables.

Curious? Take a quick peek here: https://github.com/gbb/par_psql/blob/master/BENCHMARKS.md

Wondering what I'm missing here. Any ideas?

Graeme.

On Thu, Jul 2, 2015 at 9:15 AM, Graeme B. Bell <graeme.bell@nibio.no> wrote:
Hi everyone,

I've written a new open source tool for easily parallelising SQL scripts in postgres.   [obligatory plug:   https://github.com/gbb/par_psql   ]

Using it, I'm seeing a problem I've seen in other postgres projects involving parallelisation in the last 12 months.

Basically:

- I have machines here with up to 16 CPUs and 128GB memory, very fast SSDs and controller etc, carefully configured kernel/postgresql.conf for high performance.

- Ordinary queries parallelise nearly perfectly (e.g. SELECT some_stuff ...), e.g. almost up to 16x performance improvement.

- Calls to CPU-intensive user-defined pl/pgsql functions (e.g. SELECT myfunction(some_stuff)) do not parallelise well, even when they are independent or accessing tables in a read-only way. They hit a limit at 2.5x performance improvement relative to single-CPU performance (pg9.4) and 2x performance (pg9.3). This is about 6 times slower than I'm expecting.

- Can't see what would be locking. It seems like it's the pl/pgsql environment itself that is somehow locking or incurring some huge frictional costs. Whether I use independently defined functions, independent source tables, independent output tables, makes no difference whatsoever, so it doesn't feel 'locky'. It also doesn't seem to be WAL/synchronisation related, as the machines I'm using can hit absurdly high pgbench rates, and I'm using unlogged tables.

Curious? Take a quick peek here: https://github.com/gbb/par_psql/blob/master/BENCHMARKS.md

Wondering what I'm missing here. Any ideas?

No ideas, but I ran into the same thing. I have a set of C/C++ functions that put some chemistry calculations into Postgres as extensions (things like, "calculate the molecular weight of this molecule"). As SQL functions, the whole thing bogged down, and we never got the scalability we needed. On our 8-CPU setup, we couldn't get more than 2 CPUs busy at the same time, even with dozens of clients.

When I moved these same functions into an Apache fast-CGI HTTP service (exact same code, same network overhead), I could easily scale up and use the full 100% of all eight CPUs.

I have no idea why, and never investigated further. The convenience of having the functions in SQL wasn't that important.

Craig
 

Graeme.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------
On 07/07/2015 08:05 PM, Craig James wrote:
>
>
> No ideas, but I ran into the same thing. I have a set of C/C++ functions
> that put some chemistry calculations into Postgres as extensions (things
> like, "calculate the molecular weight of this molecule"). As SQL
> functions, the whole thing bogged down, and we never got the scalability
> we needed. On our 8-CPU setup, we couldn't get more than 2 CPUs busy at
> the same time, even with dozens of clients.
>
> When I moved these same functions into an Apache fast-CGI HTTP service
> (exact same code, same network overhead), I could easily scale up and
> use the full 100% of all eight CPUs.
>
> I have no idea why, and never investigated further. The convenience of
> having the functions in SQL wasn't that important.

I admit that I haven't read this whole thread but:

Using Apache Fast-CGI, you are going to fork a process for each instance
of the function being executed and that in turn will use all CPUs up to
the max available resource.

With PostgreSQL, that isn't going to happen unless you are running (at
least) 8 functions across 8 connections.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


> On 07/07/2015 08:05 PM, Craig James wrote:
>>
>>
>> No ideas, but I ran into the same thing. I have a set of C/C++ functions
>> that put some chemistry calculations into Postgres as extensions (things
>> like, "calculate the molecular weight of this molecule"). As SQL
>> functions, the whole thing bogged down, and we never got the scalability
>> we needed. On our 8-CPU setup, we couldn't get more than 2 CPUs busy at
>> the same time, even with dozens of clients.


Hi all,

The sample code / results were put up last night at http://github.com/gbb/ppppt

Craig's problem sounds similar to my own, assuming he means running C indirectly via SQL vs running C more directly.
Lots of parallel connections to postgres but maximum 2 CPUs of scaling (and it gets worse, as you try to run more
things).

Tom Lane has posted an interesting comment over on the bugs list which identies a likely source at least one of the
problems,maybe both.  
It seems to be linked to internal locking inside postgres (which makes sense, given the results - both problems feel
'lock-y').
Also, he mentions a workaround for some functions that scales to 8-way apparently.

http://www.postgresql.org/message-id/31265.1436317984@sss.pgh.pa.us

I think it's potentially a big problem for CPU intensive postgres libraries like pgrouting, or perhaps the postgis &
postgisraster functions, things like that. 
I don't know how well their functions are marked for e.g. immutability.
Are there any postgis devs on this list?

Graeme Bell





On Tue, Jul 7, 2015 at 10:31 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 07/07/2015 08:05 PM, Craig James wrote:


No ideas, but I ran into the same thing. I have a set of C/C++ functions
that put some chemistry calculations into Postgres as extensions (things
like, "calculate the molecular weight of this molecule"). As SQL
functions, the whole thing bogged down, and we never got the scalability
we needed. On our 8-CPU setup, we couldn't get more than 2 CPUs busy at
the same time, even with dozens of clients.

When I moved these same functions into an Apache fast-CGI HTTP service
(exact same code, same network overhead), I could easily scale up and
use the full 100% of all eight CPUs.

I have no idea why, and never investigated further. The convenience of
having the functions in SQL wasn't that important.

I admit that I haven't read this whole thread but:

Using Apache Fast-CGI, you are going to fork a process for each instance of the function being executed and that in turn will use all CPUs up to the max available resource.

With PostgreSQL, that isn't going to happen unless you are running (at least) 8 functions across 8 connections.

Well, right, which is why I mentioned "even with dozens of clients." Shouldn't that scale to at least all of the CPUs in use if the function is CPU intensive (which it is)?

Craig




JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------
On 07/08/2015 10:48 AM, Craig James wrote:

>     I admit that I haven't read this whole thread but:
>
>     Using Apache Fast-CGI, you are going to fork a process for each
>     instance of the function being executed and that in turn will use
>     all CPUs up to the max available resource.
>
>     With PostgreSQL, that isn't going to happen unless you are running
>     (at least) 8 functions across 8 connections.
>
>
> Well, right, which is why I mentioned "even with dozens of clients."
> Shouldn't that scale to at least all of the CPUs in use if the function
> is CPU intensive (which it is)?

In theory but that isn't PostgreSQL that does that, it will be the
kernel scheduler. Although (and I am grasping at straws):

I wonder if the execution is taking place outside of the backend proper
or... are you using a pooler?

JD


>
> Craig
>
>
>
>
>     JD
>
>     --
>     Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
>     PostgreSQL Centered full stack support, consulting and development.
>     Announcing "I'm offended" is basically telling the world you can't
>     control your own emotions, so everyone else should do it for you.
>
>
>
>
> --
> ---------------------------------
> Craig A. James
> Chief Technology Officer
> eMolecules, Inc.
> ---------------------------------


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


On Wed, Jul 8, 2015 at 10:52 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 07/08/2015 10:48 AM, Craig James wrote:

    I admit that I haven't read this whole thread but:

    Using Apache Fast-CGI, you are going to fork a process for each
    instance of the function being executed and that in turn will use
    all CPUs up to the max available resource.

    With PostgreSQL, that isn't going to happen unless you are running
    (at least) 8 functions across 8 connections.


Well, right, which is why I mentioned "even with dozens of clients."
Shouldn't that scale to at least all of the CPUs in use if the function
is CPU intensive (which it is)?

In theory but that isn't PostgreSQL that does that, it will be the kernel scheduler. Although (and I am grasping at straws):

I wonder if the execution is taking place outside of the backend proper or... are you using a pooler?

No pooler, and the functions were in an ordinary SQL extension .so library and loaded as

   CREATE OR REPLACE FUNCTION funcname( ... ) returns ... 
   AS 'libxxx.so', 'funcname LANGUAGE c STRICT IMMUTABLE COST 10000;

Craig


JD



Craig




    JD

    --
    Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
    PostgreSQL Centered full stack support, consulting and development.
    Announcing "I'm offended" is basically telling the world you can't
    control your own emotions, so everyone else should do it for you.




--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------
On Wed, Jul 8, 2015 at 12:48 PM, Craig James <cjames@emolecules.com> wrote:
> On Tue, Jul 7, 2015 at 10:31 PM, Joshua D. Drake <jd@commandprompt.com>
> wrote:
>>
>>
>> On 07/07/2015 08:05 PM, Craig James wrote:
>>>
>>>
>>>
>>> No ideas, but I ran into the same thing. I have a set of C/C++ functions
>>> that put some chemistry calculations into Postgres as extensions (things
>>> like, "calculate the molecular weight of this molecule"). As SQL
>>> functions, the whole thing bogged down, and we never got the scalability
>>> we needed. On our 8-CPU setup, we couldn't get more than 2 CPUs busy at
>>> the same time, even with dozens of clients.
>>>
>>> When I moved these same functions into an Apache fast-CGI HTTP service
>>> (exact same code, same network overhead), I could easily scale up and
>>> use the full 100% of all eight CPUs.
>>>
>>> I have no idea why, and never investigated further. The convenience of
>>> having the functions in SQL wasn't that important.
>>
>>
>> I admit that I haven't read this whole thread but:
>>
>> Using Apache Fast-CGI, you are going to fork a process for each instance
>> of the function being executed and that in turn will use all CPUs up to the
>> max available resource.
>>
>> With PostgreSQL, that isn't going to happen unless you are running (at
>> least) 8 functions across 8 connections.
>
>
> Well, right, which is why I mentioned "even with dozens of clients."
> Shouldn't that scale to at least all of the CPUs in use if the function is
> CPU intensive (which it is)?

only in the absence of inter-process locking and cache line bouncing.

merlin


On 2015-07-08 13:46:53 -0500, Merlin Moncure wrote:
> On Wed, Jul 8, 2015 at 12:48 PM, Craig James <cjames@emolecules.com> wrote:
> > On Tue, Jul 7, 2015 at 10:31 PM, Joshua D. Drake <jd@commandprompt.com>
> >> Using Apache Fast-CGI, you are going to fork a process for each instance
> >> of the function being executed and that in turn will use all CPUs up to the
> >> max available resource.
> >>
> >> With PostgreSQL, that isn't going to happen unless you are running (at
> >> least) 8 functions across 8 connections.
> >
> >
> > Well, right, which is why I mentioned "even with dozens of clients."
> > Shouldn't that scale to at least all of the CPUs in use if the function is
> > CPU intensive (which it is)?
>
> only in the absence of inter-process locking and cache line bouncing.

And addititionally memory bandwidth (shared between everything, even in
the numa case), cross socket/bus bandwidth (absolutely performance
critical in multi-socket configurations), cache capacity (shared between
cores, and sometimes even sockets!).


On Wed, Jul 8, 2015 at 1:27 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-08 13:46:53 -0500, Merlin Moncure wrote:
> On Wed, Jul 8, 2015 at 12:48 PM, Craig James <cjames@emolecules.com> wrote:
> > On Tue, Jul 7, 2015 at 10:31 PM, Joshua D. Drake <jd@commandprompt.com>
> >> Using Apache Fast-CGI, you are going to fork a process for each instance
> >> of the function being executed and that in turn will use all CPUs up to the
> >> max available resource.
> >>
> >> With PostgreSQL, that isn't going to happen unless you are running (at
> >> least) 8 functions across 8 connections.
> >
> >
> > Well, right, which is why I mentioned "even with dozens of clients."
> > Shouldn't that scale to at least all of the CPUs in use if the function is
> > CPU intensive (which it is)?
>
> only in the absence of inter-process locking and cache line bouncing.

And addititionally memory bandwidth (shared between everything, even in
the numa case), cross socket/bus bandwidth (absolutely performance
critical in multi-socket configurations), cache capacity (shared between
cores, and sometimes even sockets!).

From my admittedly naive point of view, it's hard to see why any of this matters. I have functions that do purely CPU-intensive mathematical calculations ... you could imagine something like is_prime(N) that determines if N is a prime number. I have eight clients that connect to eight backends. Each client issues an SQL command like, "select is_prime(N)" where N is a simple number.

Are you saying that in order to calculate is_prime(N), all of that stuff (inter-process locking, memory bandwith, bus bandwidth, cache capacity, etc.) is even relevant? And if so, how is it that Postgres is so different from an Apache fast-CGI program that runs the exact same is_prime(N) calculation?

Just curious ... as I said, I've already implemented a different solution.

Craig
On 2015-07-08 15:38:24 -0700, Craig James wrote:
> From my admittedly naive point of view, it's hard to see why any of this
> matters. I have functions that do purely CPU-intensive mathematical
> calculations ... you could imagine something like is_prime(N) that
> determines if N is a prime number. I have eight clients that connect to
> eight backends. Each client issues an SQL command like, "select
> is_prime(N)" where N is a simple number.

I mostly replied to Merlin's general point (additionally in the context of
plpgsql).

But I have a hard time seing that postgres would be the bottleneck for a
is_prime() function (or something with similar characteristics) that's
written in C where the average runtime is more than, say, a couple
thousand cyles.  I'd like to see a profile of that.

Andres


andres@anarazel.de (Andres Freund) writes:
> On 2015-07-08 15:38:24 -0700, Craig James wrote:
>> From my admittedly naive point of view, it's hard to see why any of this
>> matters. I have functions that do purely CPU-intensive mathematical
>> calculations ... you could imagine something like is_prime(N) that
>> determines if N is a prime number. I have eight clients that connect to
>> eight backends. Each client issues an SQL command like, "select
>> is_prime(N)" where N is a simple number.

> I mostly replied to Merlin's general point (additionally in the context of
> plpgsql).

> But I have a hard time seing that postgres would be the bottleneck for a
> is_prime() function (or something with similar characteristics) that's
> written in C where the average runtime is more than, say, a couple
> thousand cyles.  I'd like to see a profile of that.

But that was not the case that Graeme was complaining about.  He's talking
about simple-arithmetic-and-looping written in plpgsql, in a volatile
function that is going to take a new snapshot for every statement, even if
that's only "n := n+1".  So it's going to spend a substantial fraction of
its runtime banging on the ProcArray, and that doesn't scale.  If you
write your is_prime function purely in plpgsql, and don't bother to mark
it nonvolatile, *it will not scale*.  It'll be slow even in single-thread
terms, but it'll be particularly bad if you're saturating a multicore
machine with it.

One of my Salesforce colleagues has been looking into ways that we could
decide to skip the per-statement snapshot acquisition even in volatile
functions, if we could be sure that a particular statement isn't going to
do anything that would need a snapshot.  Now, IMO that doesn't really do
much for properly written plpgsql; but there's an awful lot of bad plpgsql
code out there, and it can make a huge difference for that.

            regards, tom lane


On 08 Jul 2015, at 22:27, Andres Freund <andres@anarazel.de> wrote:

> On 2015-07-08 13:46:53 -0500, Merlin Moncure wrote:
>> On Wed, Jul 8, 2015 at 12:48 PM, Craig James <cjames@emolecules.com> wrote:
>>>
>>> Well, right, which is why I mentioned "even with dozens of clients."
>>> Shouldn't that scale to at least all of the CPUs in use if the function is
>>> CPU intensive (which it is)?
>>
>> only in the absence of inter-process locking and cache line bouncing.
>
> And addititionally memory bandwidth (shared between everything, even in
> the numa case), cross socket/bus bandwidth (absolutely performance
> critical in multi-socket configurations), cache capacity (shared between
> cores, and sometimes even sockets!).

1. Note for future readers - it's also worth noting that depending on the operation, and on your hardware, you may have
less"CPU cores" than you think to parallelise upon. 

1a. For example AMD CPUs list the number of integer cores (e.g. 16), but there is actually only half as many cores
availablefor floating point work (8). So if your functions need to use floating point, your scaling will suffer badly
onFP functions.  

https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)
  "In terms of hardware complexity and functionality, this "module" is equal to a dual-core processor in its integer
power,and to a single-core processor in its floating-point power: for each two integer cores, there is one
floating-pointcore." 


1b. Or, if you have hyper-threading enabled on an Intel CPU, you may think you have e.g. 8 cores, but if all the
threadsare running the same type of operation repeatedly, it won't be possible for the hyper-threading to work well and
you'llonly get 4 in practice. Maybe less due to overheads. Or, if your work is continuallly going to main memory for
data(e.g. limited by the memory bus), it will run at 4-core speed, because the cores have to share the same memory bus.


Hyper-threading depends on the 2 logical cores being asked to perform two different types of tasks at once (each having
relativelylower demands on memory). 

"When execution resources would not be used by the current task in a processor without hyper-threading, and especially
whenthe processor is stalled, a hyper-threading equipped processor can use those execution resources to execute another
scheduledtask." 
https://en.wikipedia.org/wiki/Hyper-threading
https://en.wikipedia.org/wiki/Superscalar


2. Keep in mind also when benchmarking that it's normal to see an small drop-off when you hit the maximum number of
coresfor your system.  
After all, the O/S and the benchmark program and anything else you have running will need a core or two.





On 2015-07-08 23:38:38 -0400, Tom Lane wrote:
> andres@anarazel.de (Andres Freund) writes:
> > On 2015-07-08 15:38:24 -0700, Craig James wrote:
> >> From my admittedly naive point of view, it's hard to see why any of this
> >> matters. I have functions that do purely CPU-intensive mathematical
> >> calculations ... you could imagine something like is_prime(N) that
> >> determines if N is a prime number. I have eight clients that connect to
> >> eight backends. Each client issues an SQL command like, "select
> >> is_prime(N)" where N is a simple number.
>
> > I mostly replied to Merlin's general point (additionally in the context of
> > plpgsql).
>
> > But I have a hard time seing that postgres would be the bottleneck for a
> > is_prime() function (or something with similar characteristics) that's
> > written in C where the average runtime is more than, say, a couple
> > thousand cyles.  I'd like to see a profile of that.
>
> But that was not the case that Graeme was complaining about.

No, Craig was complaining about that case...

> One of my Salesforce colleagues has been looking into ways that we could
> decide to skip the per-statement snapshot acquisition even in volatile
> functions, if we could be sure that a particular statement isn't going to
> do anything that would need a snapshot.

Yea, I actually commented about that on IRC as well.

I was thinking about actually continuing to get a snapshot, but mark it
as 'complete on usage'. I.e. only call GetSnapshotData() only when the
snapshot is used to decide about visibility.  We probably can't do that
in the toplevel visibility case because it'll probably have noticeable
semantic effects, but ISTM it should be doable for the volatile function
using spi case.


On 09 Jul 2015, at 05:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> If you
> write your is_prime function purely in plpgsql, and don't bother to mark
> it nonvolatile, *it will not scale*.

> much for properly written plpgsql; but there's an awful lot of bad plpgsql
> code out there, and it can make a huge difference for that.


Hi Tom,

I object to phrases like 'don't bother to mark it' and 'bad plpgsql' here. That is putting the blame on programmers.
Clearly,if there is no end of code out there that isn't right in this regard, there's something very wrong in the
projectdocumentation. 

1. I have been writing pl/pgsql on and off for a couple of years now and I've read quite a bit of the postgres
doumentation,but I don't recall seeing a clear statement telling me I should mark pl/pgsql functions nonvolatile
whereverpossible or throw all performance and scalability out the window. I'm sure there may be a line hidden somewhere
inthe docs, but judging from the impact it has in practice, this seems like a very fundamental concept that should be
repeatedlyand clearly marked in the docs.  

2. Furthermore, I have never come across anything in the documentation that made it clear to me that any pl/pgsql
functionI write will, by default, be taking out locks for every single statement in the code. I've written code in I
dunno,maybe 15-20 different languages in my life, and I can't think of another language offhand that does that by
default.From the reactions on this thread to this benchmark and the par_psql benchmarks, it doesn't seem that it was
evenimmediately obvious to many postgres enthusiasts and developers. 

3. I don't disagree that the benchmark code is objectively 'bad' in the sense that it is missing an important
optimisation. 

But I really don't think it helps to frame this as laziness or "bad" in any other sense of the word e.g. 'clumsy'.

Let's look at the postgresql documentation for some examples of 'bad' and lazy code:

http://www.postgresql.org/docs/9.3/static/plpgsql-structure.html
http://www.postgresql.org/docs/9.3/static/plpgsql-declarations.html

There are about 13 functions on that page.
How many functions on that page make use non-volatile or immutable wherever it would be appropriate?
zero.

or this one: http://www.postgresql.org/docs/9.3/static/plpgsql-control-structures.html
zero

or this one: http://www.postgresql.org/docs/9.3/static/plpgsql-cursors.html#PLPGSQL-CURSOR-USING
zero

The reason 90% of people out there are 'not bothering' and 'writing bad code' is because **99% of the postgresql
documentationteaches them to do it that way**.  

So when you talk about other people 'not bothering' to do things - who is really at fault here what for what you see as
endemic'bad' or 'lazy' code? Is it the new postgres programmers, or the people that taught them with "bad" examples
consistentlythroughout the *entire project documentation*, starting from the very first example?  

I think I'm going to raise this as a documentation bug.

Graeme.









On Wed, Jul 8, 2015 at 5:38 PM, Craig James <cjames@emolecules.com> wrote:
> On Wed, Jul 8, 2015 at 1:27 PM, Andres Freund <andres@anarazel.de> wrote:
>>
>> On 2015-07-08 13:46:53 -0500, Merlin Moncure wrote:
>> > On Wed, Jul 8, 2015 at 12:48 PM, Craig James <cjames@emolecules.com>
>> > wrote:
>> > > On Tue, Jul 7, 2015 at 10:31 PM, Joshua D. Drake
>> > > <jd@commandprompt.com>
>> > >> Using Apache Fast-CGI, you are going to fork a process for each
>> > >> instance
>> > >> of the function being executed and that in turn will use all CPUs up
>> > >> to the
>> > >> max available resource.
>> > >>
>> > >> With PostgreSQL, that isn't going to happen unless you are running
>> > >> (at
>> > >> least) 8 functions across 8 connections.
>> > >
>> > >
>> > > Well, right, which is why I mentioned "even with dozens of clients."
>> > > Shouldn't that scale to at least all of the CPUs in use if the
>> > > function is
>> > > CPU intensive (which it is)?
>> >
>> > only in the absence of inter-process locking and cache line bouncing.
>>
>> And addititionally memory bandwidth (shared between everything, even in
>> the numa case), cross socket/bus bandwidth (absolutely performance
>> critical in multi-socket configurations), cache capacity (shared between
>> cores, and sometimes even sockets!).
>
>
> From my admittedly naive point of view, it's hard to see why any of this
> matters. I have functions that do purely CPU-intensive mathematical
> calculations ... you could imagine something like is_prime(N) that
> determines if N is a prime number. I have eight clients that connect to
> eight backends. Each client issues an SQL command like, "select is_prime(N)"
> where N is a simple number.
>
> Are you saying that in order to calculate is_prime(N), all of that stuff
> (inter-process locking, memory bandwith, bus bandwidth, cache capacity,
> etc.) is even relevant? And if so, how is it that Postgres is so different
> from an Apache fast-CGI program that runs the exact same is_prime(N)
> calculation?
>
> Just curious ... as I said, I've already implemented a different solution.

If your is_prime() was written in C and was written so that it did not
utilize the database API, it should scale up quite nicely.  This can
be easily proved.  On my quad core workstation,

postgres=# select 12345! * 0;
 ?column?
──────────
        0
(1 row)

Time: 10435.554 ms


...which is heavily cpu bound, takes about 10 seconds.  scaling out to
4 threads via:

time ~/pgdev/bin/pgbench -n -t1 -c4 -f <(echo "select 12345! * 0;")

yields:
real 0m11.317s
user 0m0.001s
sys 0m0.005s

...I'll call that pretty good scaling.  The reason why this scales so
good is that the numeric code is all operating on local data
structures and is not involving backend componentry with it's various
attached complexity such as having to be checked for being visible to
the current transaction.

I submit that toy benchmarks like factoring or pi digits are not
really good indicators of language scaling and performance because
just about all real world code involves data structures, i/o, memory
allocation, amateur coders, etc.   Java tends to approach C in
benchmark shootouts but woefully underperforms my expectations
relative to C in code that does things that's actually useful (aside:
if you think I'm knocking java, the situation is even worse with most
other languages I come across).

pl/pgsql is simply not optimized for that style of coding although if
you know postgres you can start to tickle the limits of what's
expected from the language.  If that isn't working for you, pl/v8
strikes me as the best alternative due to it's performance and good
integration with postgres data structures (in fact, I'd be arguing for
it to be moved to core if the v8 dependency wasn't so capricious).
Either way, I'll advocate any solution that allows you to code inside
the database environment as opposed to the client side.

merlin


On Thu, Jul 9, 2015 at 4:44 AM, Graeme B. Bell <graeme.bell@nibio.no> wrote:
> On 09 Jul 2015, at 05:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> If you
>> write your is_prime function purely in plpgsql, and don't bother to mark
>> it nonvolatile, *it will not scale*.
>
>> much for properly written plpgsql; but there's an awful lot of bad plpgsql
>> code out there, and it can make a huge difference for that.
>
>
> Hi Tom,
>
> I object to phrases like 'don't bother to mark it' and 'bad plpgsql' here. That is putting the blame on programmers.
Clearly,if there is no end of code out there that isn't right in this regard, there's something very wrong in the
projectdocumentation. 
>
> 1. I have been writing pl/pgsql on and off for a couple of years now and I've read quite a bit of the postgres
doumentation,but I don't recall seeing a clear statement telling me I should mark pl/pgsql functions nonvolatile
whereverpossible or throw all performance and scalability out the window. I'm sure there may be a line hidden somewhere
inthe docs, but judging from the impact it has in practice, this seems like a very fundamental concept that should be
repeatedlyand clearly marked in the docs. 
>
> 2. Furthermore, I have never come across anything in the documentation that made it clear to me that any pl/pgsql
functionI write will, by default, be taking out locks for every single statement in the code. I've written code in I
dunno,maybe 15-20 different languages in my life, and I can't think of another language offhand that does that by
default.From the reactions on this thread to this benchmark and the par_psql benchmarks, it doesn't seem that it was
evenimmediately obvious to many postgres enthusiasts and developers. 
>
> 3. I don't disagree that the benchmark code is objectively 'bad' in the sense that it is missing an important
optimisation.

Particularly with regards documentation, a patch improving things is
much more likely to improve the situation than griping.  Also,
conversation on this list gets recorded for posterity and google is
remarkably good at matching people looking for problems with
solutions.  So, even in absence of a patch perhaps we've made the
lives of future head-scratchers a little bit easier with this
discussion.

merlin


Graeme B. Bell schrieb am 09.07.2015 um 11:44:
> I don't recall seeing a clear statement telling me I should mark pl/pgsql
> functions nonvolatile wherever possible or throw all performance and
> scalability out the window.

From: http://www.postgresql.org/docs/current/static/xfunc-volatility.html

   "For best optimization results, you should label your functions
    with the strictest volatility category that is valid for them."



On 09 Jul 2015, at 15:22, Thomas Kellerer <spam_eater@gmx.net> wrote:

> Graeme B. Bell schrieb am 09.07.2015 um 11:44:
>> I don't recall seeing a clear statement telling me I should mark pl/pgsql
>> functions nonvolatile wherever possible or throw all performance and
>> scalability out the window.
>
> From: http://www.postgresql.org/docs/current/static/xfunc-volatility.html
>
>   "For best optimization results, you should label your functions
>    with the strictest volatility category that is valid for them."


Hi Thomas,

Thank you very much for the link.

However, the point I was making wasn't that no sentence exists anywhere. My point was that I've read the docs more than
anyoneelse in my institute and I was completely unaware of this.  

It also quite vague - if you hand that to a younger programmer in particular, how do they implement it in practice?
Whenis it important to do it?  If this one factor silently breaks multiprocessor scaling of pl/pgsql, and
multiprocessingis the biggest trend in CPU processing of the last decade (comparing server CPUS of 2005 with 2015),
thenwhy is this information not up front and clear? 


A second point to keep in mind that optimization and parallelisation/scalability are not always the same thing.

For example, in one project I took a bunch of looped parallel UPDATEs on a set of 50 tables, and rewrote them so as to
runthe loop all at once inside a pl/pgsql function. Crudely, I took out the table-level for loop and put it at
row-levelinstead.  

I expected they'd execute much faster if UPDATEs were using data still in cache. Also, I would be updating without
writingout WAL entries to disk repeatedly.  

It turns out the update per row ran much faster - as expected - when I used one table, but when I ran it in parallel on
manytables, the performance was even worse than when I started. If you look at the benchmarks, you'll see that
performancedrops through the floor at 8-16 cores. I think that was when I first noticed this bug/feature. 

[If anyone is curious, the way I solved that one in the end was to pre-calculate every possible way the tables might be
updatedafter N loops of updates using Python, and import that as a lookup table into PG. It turns out that although we
had10's of GB of data per table, there were only about 100,00 different types of situation, and only e.g. 80 iterations
toconsider). Then I ran a single set of UPDATEs with no pl/pgsql. It was something like a 10000x performance
improvement.]

Graeme.

>>
>> 3. I don't disagree that the benchmark code is objectively 'bad' in the sense that it is missing an important
optimisation.
>
> Particularly with regards documentation, a patch improving things is
> much more likely to improve the situation than griping.  Also,
> conversation on this list gets recorded for posterity and google is
> remarkably good at matching people looking for problems with
> solutions.  So, even in absence of a patch perhaps we've made the
> lives of future head-scratchers a little bit easier with this
> discussion.

I agree that patch>gripe, and about the google aspect. But nonetheless, a well-intentioned gripe is > ignorance of a
problem. 

As mentioned earlier, I'm sick just now and will be back in hospital again tomorrow & monday, so a patch may be a
littlebit much to ask from me here :-) It's a bit much even keeping up with the posts on the thread so far. 

I might try to fix the documentation a bit later, though as someone with no experience in marking up volatility on
pl/pgsqlfunctions I doubt my efforts would be that great. I also have other OSS project contributions that need some
attentionfirst.  

Re: the google effect. Are these mailing list archives mirrored anywhere, incidentally? For example, I notice we just
losthttp:reddit.com/r/amd at the weekend, all the discussion of the last few years on that forum is out of reach.   

Graeme Bell


On Thu, Jul 9, 2015 at 10:12 AM, Graeme B. Bell <graeme.bell@nibio.no> wrote:
>>>
>>> 3. I don't disagree that the benchmark code is objectively 'bad' in the sense that it is missing an important
optimisation.
>>
>> Particularly with regards documentation, a patch improving things is
>> much more likely to improve the situation than griping.  Also,
>> conversation on this list gets recorded for posterity and google is
>> remarkably good at matching people looking for problems with
>> solutions.  So, even in absence of a patch perhaps we've made the
>> lives of future head-scratchers a little bit easier with this
>> discussion.
>
> I agree that patch>gripe, and about the google aspect. But nonetheless, a well-intentioned gripe is > ignorance of a
problem.
>
> As mentioned earlier, I'm sick just now and will be back in hospital again tomorrow & monday, so a patch may be a
littlebit much to ask from me here :-) It's a bit much even keeping up with the posts on the thread so far. 
>
> I might try to fix the documentation a bit later, though as someone with no experience in marking up volatility on
pl/pgsqlfunctions I doubt my efforts would be that great. I also have other OSS project contributions that need some
attentionfirst. 
>
> Re: the google effect. Are these mailing list archives mirrored anywhere, incidentally? For example, I notice we just
losthttp:reddit.com/r/amd at the weekend, all the discussion of the last few years on that forum is out of reach. 

The community maintains it's own mailing list archives in
postgresql.org.  Short of an array of tactical nuclear strikes this is
going to be preserved because it represents the history of the project
and in many ways is as important as the source code itself.

The archives are also mirrored by a number of high quality providers
such as nabble (which tend to beat our archives in google rankings --
likely due to the improved interface).

merlin


On 09 Jul 2015, at 17:42, Merlin Moncure <mmoncure@gmail.com> wrote:

> The community maintains it's own mailing list archives in
> postgresql.org.  Short of an array of tactical nuclear strikes this is
> going to be preserved

Good to know, I've seen a lot of dead software projects throughout my life.

But still - we will have to pray that Kim Jong Un never decides to become a MySQL contributor... :)

Graeme.

>>>>> "GBB" == Graeme B Bell <graeme.bell@nibio.no> writes:

GBB> 1a. For example AMD CPUs list the number of integer cores (e.g. 16),
GBB> but there is actually only half as many cores available for floating
GBB> point work (8). So if your functions need to use floating point, your
GBB> scaling will suffer badly on FP functions.

That is half as many 256-bit float units; for scalar math and for
128-bit vector math each core gets a half of the float unit.

Only for the 256-bit vector math do the schedulars have to compete for
float unit access.

-JimC
--
James Cloos <cloos@jhcloos.com>         OpenPGP: 0x997A9F17ED7DAEA6