Thread: bg worker: patch 1 of 6 - permanent process

bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
This patch turns the existing autovacuum launcher into an always running
process, partly called the coordinator. If autovacuum is disabled, the
coordinator process still gets started and keeps around, but it doesn't
dispatch vacuum jobs. The coordinator process now uses imessages to
communicate with background (autovacuum) workers and to trigger a vacuum
job. So please apply the imessages patches [1] before any of the bg
worker ones.

It also adds two new controlling GUCs: min_spare_background_workers and
max_spare_background_workers. The autovacuum_max_workers still serves as
a limit for the total amount of background/autovacuum workers. (It is
going to be renamed in step 4).

Interaction with the postmaster has changed a bit. If autovacuum is
disabled, the coordinator isn't started with
PMSIGNAL_START_AUTOVAC_LAUNCHER anymore, instead there is an
IMSGT_FORCE_VACUUM that any backend might want to send to the
coordinator to prevent data loss due to XID wrap around (see changes in
access/transam/varsup.c). The SIGUSR2 from postmaster to the coordinator
doesn't need to be multiplexed anymore, but is only sent to inform about
fork failures.

A note on the dependency on imessages: for just autovacuum, this message
passing infrastructure isn't absolutely necessary and could be removed.
However, for Postgres-R it turned out to be really helpful and I think
chances are good another user of this background worker infrastructure
would also want to transfer data of varying size to and from these workers.

Just as in the current version of Postgres, the background worker
terminates immediately after having performed a vacuum job.


Open issue: if the postmaster fails to fork a new background worker, the
coordinator still waits a whole second after receiving the SIGUSR2
notification signal from the postmaster. That might have been fine with
just autovacuum, but for other jobs, namely changeset application in
Postgres-R, that's not feasible.


[1] dynshmem and imessages patch
http://archives.postgresql.org/message-id/ab0cd52a64e788f4ecb4515d1e6e4691@localhost


Attachment

Re: bg worker: patch 1 of 6 - permanent process

From
Itagaki Takahiro
Date:
On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner <markus@bluegap.ch> wrote:
> This patch turns the existing autovacuum launcher into an always running
> process, partly called the coordinator. If autovacuum is disabled, the
> coordinator process still gets started and keeps around, but it doesn't
> dispatch vacuum jobs.

I think this part is a reasonable proposal, but...

> The coordinator process now uses imessages to communicate with background
> (autovacuum) workers and to trigger a vacuum job.
> It also adds two new controlling GUCs: min_spare_background_workers and
> max_spare_background_workers.

Other changes in the patch doesn't seem be always needed for the purpose.
In other words, the patch is not minimal.
The original purpose could be done without IMessage.
Also, min/max_spare_background_workers are not used in the patch at all.
(BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is
maybe typo.)

The most questionable point for me is why you didn't add any hook functions
in the coordinator process. With the patch, you can extend the coordinator
protocols with IMessage, but it requires patches to core at handle_imessage().
If you want fully-extensible workers, we should provide a method to develop
worker codes in an external plugin.

Is it possible to develop your own codes in the plugin? If possible, you can
use IMessage as a private protocol freely in the plugin. Am I missing something?

--
Itagaki Takahiro


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Wed, Aug 25, 2010 at 9:39 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:
> On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> This patch turns the existing autovacuum launcher into an always running
>> process, partly called the coordinator. If autovacuum is disabled, the
>> coordinator process still gets started and keeps around, but it doesn't
>> dispatch vacuum jobs.
>
> I think this part is a reasonable proposal, but...

It's not clear to me whether it's better to have a single coordinator
process that handles both autovacuum and other things, or whether it's
better to have two separate processes.

>> The coordinator process now uses imessages to communicate with background
>> (autovacuum) workers and to trigger a vacuum job.
>> It also adds two new controlling GUCs: min_spare_background_workers and
>> max_spare_background_workers.
>
> Other changes in the patch doesn't seem be always needed for the purpose.
> In other words, the patch is not minimal.
> The original purpose could be done without IMessage.
> Also, min/max_spare_background_workers are not used in the patch at all.
> (BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is
> maybe typo.)
>
> The most questionable point for me is why you didn't add any hook functions
> in the coordinator process. With the patch, you can extend the coordinator
> protocols with IMessage, but it requires patches to core at handle_imessage().
> If you want fully-extensible workers, we should provide a method to develop
> worker codes in an external plugin.

I agree with this criticism, but the other thing that strikes me as a
nonstarter is having the postmaster participate in the imessages
framework.  Our general rule is that the postmaster must avoid
touching shared memory; else a backend that scribbles on shared memory
might take out the postmaster, leading to a failure of the
crash-and-restart logic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Itagaki Takahiro
Date:
On Thu, Aug 26, 2010 at 11:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner <markus@bluegap.ch> wrote:
>>> This patch turns the existing autovacuum launcher into an always running
>>> process, partly called the coordinator.
>
> It's not clear to me whether it's better to have a single coordinator
> process that handles both autovacuum and other things, or whether it's
> better to have two separate processes.

Ah, we can separate the proposal to two topics: A. Support to run non-vacuum jobs from autovacuum launcher B. Support
"userdefined background processes"
 

A was proposed in the original "1 of 6" patch, but B might be more general.
If we have a separated coordinator, B will be required.

Markus, do you need B? Or A + standard backend processes are enough?
If you need B eventually, starting with B might be better.

-- 
Itagaki Takahiro


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Hi,

thanks for your feedback on this, it sort of got lost below the 
discussion about the dynamic shared memory stuff, IMO.

On 08/26/2010 04:39 AM, Robert Haas wrote:
> It's not clear to me whether it's better to have a single coordinator
> process that handles both autovacuum and other things, or whether it's
> better to have two separate processes.

It has been proposed by Alvaro and/or Tom (IIRC) to reduce code 
duplication. Compared to the former approach, it certainly seems cleaner 
that way and it has helped reduce duplicate code a lot.

I'm envisioning such a coordinator process to handle coordination of 
other background processes as well, for example for distributed and/or 
parallel querying.

Having just only one process reduces the amount of interaction required 
between coordinators (i.e. autovacuum shouldn't ever start on databases 
for which replication didn't start, yet, as the autovacuum worker would 
be unable to connect to the database at that stage). It also reduces the 
amount of extra processes required, and thus I think also general 
complexity.

What'd be the benefits of having separate coordinator processes? They'd 
be doing pretty much the same: coordinate background processes. (And 
yes, I clearly consider autovacuum to be just one kind of background 
process).

> I agree with this criticism, but the other thing that strikes me as a
> nonstarter is having the postmaster participate in the imessages
> framework.

This is simply not the case (anymore). (And one of the reasons a 
separate coordinator process is required, instead of letting the 
postmaster do this kind of coordination).

> Our general rule is that the postmaster must avoid
> touching shared memory; else a backend that scribbles on shared memory
> might take out the postmaster, leading to a failure of the
> crash-and-restart logic.

That rule is well understood and followed by the bg worker 
infrastructure patches. If you find code for which that isn't true, 
please point at it. The crash-and-restart logic should work just as it 
did with the autovacuum launcher.

Regards

Markus


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Itagaki-san,

thanks for reviewing this.

On 08/26/2010 03:39 AM, Itagaki Takahiro wrote:
> Other changes in the patch doesn't seem be always needed for the purpose.
> In other words, the patch is not minimal.

Hm.. yeah, maybe the separation between step1 and step2 is a bit 
arbitrary. I'll look into it.

> The original purpose could be done without IMessage.

Agreed, that's the one exception. I've mentioned why that is and I don't 
currently feel like coding an unneeded variant which doesn't use imessages.

> Also, min/max_spare_background_workers are not used in the patch at all.

You are right, it only starts to get used in step2, so the addition 
should probably move there, right.

> (BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is
> maybe typo.)

Uh, correct, thank you for pointing this out. (I've originally named 
these helper processes before. Merging with autovacuum, it made more 
sense to name them background *workers*)

> The most questionable point for me is why you didn't add any hook functions
> in the coordinator process.

Because I'm a hook-hater. ;-)

No, seriously: I don't see what problem hooks could have solved. I'm 
coding in C and extending the Postgres code. Deciding for hooks and an 
API to use them requires good knowledge of where exactly you want to 
hook and what API you want to provide. Then that API needs to remain 
stable for an extended time. I don't think any part of the bg worker 
infrastructure currently is anywhere close to that.

> With the patch, you can extend the coordinator
> protocols with IMessage, but it requires patches to core at handle_imessage().
> If you want fully-extensible workers, we should provide a method to develop
> worker codes in an external plugin.

It's originally intended as an internal infrastructure. Offering its 
capabilities to the outside would requires stabilization, security 
control and working out an API. All of which is certainly not something 
I intend to do.

> Is it possible to develop your own codes in the plugin? If possible, you can
> use IMessage as a private protocol freely in the plugin. Am I missing something?

Well, what problem(s) are you trying to solve with such a thing? I've no 
idea what direction you are aiming at, sorry. However, it's certainly 
different or extending bg worker, so it would need to be a separate 
patch, IMO.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
On 08/26/2010 05:01 AM, Itagaki Takahiro wrote:
> Markus, do you need B? Or A + standard backend processes are enough?
> If you need B eventually, starting with B might be better.

No, I certainly don't need B.

Why not just use an ordinary backend to do "user defined background 
processing"? It covers all of the API stability and the security issues 
I've raised.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Itagaki Takahiro
Date:
On Thu, Aug 26, 2010 at 7:42 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> Markus, do you need B? Or A + standard backend processes are enough?
>
> No, I certainly don't need B.

OK, I see why you proposed coordinator hook (yeah, I call it hook :)
rather than adding user-defined processes.

> Why not just use an ordinary backend to do "user defined background
> processing"? It covers all of the API stability and the security issues I've
> raised.

However, we have autovacuum worker processes in addition to normal backend
processes. Does it show a fact that there are some jobs we cannot run in
normal backends?
For example, normal backends cannot do anything in idle time, so a
time-based polling job is difficult in backends. It might be ok to
fork processes for each interval when the polling interval is long,
but it is not effective for short interval cases.  I'd like to use
such kind of process as an additional stats collector.

-- 
Itagaki Takahiro


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Itagaki-san,

On 08/26/2010 01:02 PM, Itagaki Takahiro wrote:
> OK, I see why you proposed coordinator hook (yeah, I call it hook :)
> rather than adding user-defined processes.

I see. If you call that a hook, I'm definitely not a hook-hater ;-)  at 
least not according to your definition.

> However, we have autovacuum worker processes in addition to normal backend
> processes. Does it show a fact that there are some jobs we cannot run in
> normal backends?

Hm.. understood. You can use VACUUM from a cron job. And that's the 
problem autovacuum solves. So in a way, that's just a convenience 
feature. You want the same for general purpose user defined background 
processing, right?

>   For example, normal backends cannot do anything in idle time, so a
> time-based polling job is difficult in backends. It might be ok to
> fork processes for each interval when the polling interval is long,
> but it is not effective for short interval cases.  I'd like to use
> such kind of process as an additional stats collector.

Did you follow the discussion I had with Dimitri, who was trying 
something similar, IIRC. See the bg worker - overview thread. There 
might be some interesting bits thinking into that direction.

Regards

Markus


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Thu, Aug 26, 2010 at 6:07 AM, Markus Wanner <markus@bluegap.ch> wrote:
> What'd be the benefits of having separate coordinator processes? They'd be
> doing pretty much the same: coordinate background processes. (And yes, I
> clearly consider autovacuum to be just one kind of background process).

I dunno.  It was just a thought.  I haven't actually looked at the
code to see how much synergy there is.  (Sorry, been really busy...)

>> I agree with this criticism, but the other thing that strikes me as a
>> nonstarter is having the postmaster participate in the imessages
>> framework.
>
> This is simply not the case (anymore). (And one of the reasons a separate
> coordinator process is required, instead of letting the postmaster do this
> kind of coordination).

Oh, OK.  I see now that I misinterpreted what you wrote.

On the more general topic of imessages, I had one other thought that
might be worth considering.  Instead of using shared memory, what
about using a file that is shared between the sender and receiver?  So
for example, perhaps each receiver will read messages from a file
called pg_messages/%d, where %d is the backend ID.  And writers will
write into that file.  Perhaps both readers and writers mmap() the
file, or perhaps there's a way to make it work with just read() and
write().  If you actually mmap() the file, you could probably manage
it in a fashion pretty similar to what you had in mind for wamalloc,
or some other setup that minimizes locking.  In particular, ISTM that
if we want this to be usable for parallel query, we'll want to be able
to have one process streaming data in while another process streams
data out, with minimal interference between these two activities.  On
the other hand, for processes that only send and receive messages
occasionally, this might just be overkill (and overhead).  You'd be
just as well off wrapping the access to the file in an LWLock: the
reader takes the lock, reads the data, marks it read, and releases the
lock.  The writer takes the lock, writes data, and releases the lock.

It almost seems to me that there are two different kinds of messages
here: control messages and data messages.  Control messages are things
like "vacuum this database!" or "flush your cache!" or "execute this
query and send the results to backend %d!" or "cancel the currently
executing query!".  They are relatively small (in some cases,
fixed-size), relatively low-volume, don't need complex locking, and
can generally be processed serially but with high priority.  Data
messages are streams of tuples, either from a remote database from
which we are replicating, or between backends that are executing a
parallel query.  These messages may be very large and extremely
high-volume, are very sensitive to concurrency problems, but are not
high-priority.  We want to process them as quickly as possible, of
course, but the work may get interrupted by control messages.  Another
point is that it's reasonable, at least in the case of parallel query,
for the action of sending a data message to *block*.  If one part of
the query is too far ahead of the rest of the query, we don't want to
queue up results forever, perhaps using CPU or I/O resources that some
other backend needs to catch up, exhausting available disk space, etc.Instead, at some point, we just block and wait
forthe queue to
 
drain.  I suppose there's no avoiding the possibility that sending a
control message might also block, but certainly we wouldn't like a
control message to block because the relevant queue is full of data
messages.

So I kind of wonder whether we ought to have two separate systems, one
for data and one for control, with someone different characteristics.
I notice that one of your bg worker patches is for OOO-messages.  I
apologize again for not having read through it, but how much does that
resemble separating the control and data channels?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Robert,

On 08/26/2010 02:44 PM, Robert Haas wrote:
> I dunno.  It was just a thought.  I haven't actually looked at the
> code to see how much synergy there is.  (Sorry, been really busy...)

No problem, was just wondering if there's any benefit you had in mind.

> On the more general topic of imessages, I had one other thought that
> might be worth considering.  Instead of using shared memory, what
> about using a file that is shared between the sender and receiver?

What would that buy us? (At the price of more system calls and disk 
I/O)? Remember that the current approach (IIRC) uses exactly one syscall 
to send a message: kill() to send the (multiplexed) signal. (Except on 
strange platforms or setups that don't have a user-space spinlock 
implementation and need to use system mutexes).

> So
> for example, perhaps each receiver will read messages from a file
> called pg_messages/%d, where %d is the backend ID.  And writers will
> write into that file.  Perhaps both readers and writers mmap() the
> file, or perhaps there's a way to make it work with just read() and
> write().  If you actually mmap() the file, you could probably manage
> it in a fashion pretty similar to what you had in mind for wamalloc,
> or some other setup that minimizes locking.

That would still require proper locking, then. So I'm not seeing the 
benefit.

> In particular, ISTM that
> if we want this to be usable for parallel query, we'll want to be able
> to have one process streaming data in while another process streams
> data out, with minimal interference between these two activities.

That's well possible with the current approach. About the only 
limitation is that a receiver can only consume the messages in the order 
they got into the queue. But pretty much any backend can send messages 
to any other backend concurrently.

(Well, except that I think there currently are bugs in wamalloc).

> On
> the other hand, for processes that only send and receive messages
> occasionally, this might just be overkill (and overhead).  You'd be
> just as well off wrapping the access to the file in an LWLock: the
> reader takes the lock, reads the data, marks it read, and releases the
> lock.  The writer takes the lock, writes data, and releases the lock.

The current approach uses plain spinlocks, which are more efficient. 
Note that both, appending as well as removing from the queue are writing 
operations, from the point of view of the queue. So I don't think 
LWLocks buy you anything here, either.

> It almost seems to me that there are two different kinds of messages
> here: control messages and data messages.  Control messages are things
> like "vacuum this database!" or "flush your cache!" or "execute this
> query and send the results to backend %d!" or "cancel the currently
> executing query!".  They are relatively small (in some cases,
> fixed-size), relatively low-volume, don't need complex locking, and
> can generally be processed serially but with high priority.  Data
> messages are streams of tuples, either from a remote database from
> which we are replicating, or between backends that are executing a
> parallel query.  These messages may be very large and extremely
> high-volume, are very sensitive to concurrency problems, but are not
> high-priority.  We want to process them as quickly as possible, of
> course, but the work may get interrupted by control messages.  Another
> point is that it's reasonable, at least in the case of parallel query,
> for the action of sending a data message to *block*.  If one part of
> the query is too far ahead of the rest of the query, we don't want to
> queue up results forever, perhaps using CPU or I/O resources that some
> other backend needs to catch up, exhausting available disk space, etc.

I agree that such a thing isn't currently covered. And it might be 
useful. However, adding two separate queues with different priority 
would be very simple to do. (Note, however, that there already are the 
standard unix signals for very simple kinds of control signals. I.e. for 
aborting a parallel query, you could simply send SIGINT to all 
background workers involved).

I understand the need to limit the amount of data in flight, but I don't 
think that sending any type of message should ever block. Messages are 
atomic in that regard. Either they are ready to be delivered (in 
entirety) or not. Thus the sender needs to hold back the message, if the 
recipient is overloaded. (Also note that currently imessages are bound 
to a maximum size of around 8 KB).

It might be interesting to note that I've just implemented some kind of 
streaming mechanism *atop* of imessages for Postgres-R. A data stream 
gets fragmented into single messages. As you pointed out, there should 
be some kind of congestion control. However, in my case, that needs to 
cover the inter-node connection as well, not just imessages. So I think 
the solution to that problem needs to be found on a higher level. I.e. 
in the Postgres-R case, I want to limit the *overall* amount of recovery 
data that's pending for a certain node. Not just the amount that's 
pending on a certain stream of within the imessages system.

Think of imessages as the IP between processes, while streaming of data 
needs something akin to TCP on top of it. (OTOH, this comparison is 
lacking, because imessages guarantee reliable and ordered delivery of 
messages).

BTW: why do you think the data heavy messages are sensitive to 
concurrency problems? I found the control messages to be rather more 
sensitive, as state changes and timing for those control messages are 
trickier to deal with.

> So I kind of wonder whether we ought to have two separate systems, one
> for data and one for control, with someone different characteristics.
> I notice that one of your bg worker patches is for OOO-messages.  I
> apologize again for not having read through it, but how much does that
> resemble separating the control and data channels?

It's something that resides within the coordinator process exclusively 
and doesn't have much to do with imessages. Postgres-R doesn't require 
the GCS to deliver (certain kind of) messages in any order, it only 
requires the GCS to guarantee reliability of message delivery (or 
notification in the form of excluding the failing node from the group in 
case delivery failed).

Thus, the coordinator needs to be able to re-order the messages, because 
bg workers need to receive the change sets in the correct order. And 
imessages guarantees to maintain the ordering.

The reason for doing this within the coordinator is to a) lower 
requirements for the GCS and b) gain more control of the data flow. I.e. 
congestion control gets much easier, if the coordinator knows the amount 
of data that's queued. (As opposed to having lots of TCP connections, 
each of which queues an unknown amount of data).

As is evident, all of these decisions are rather Postgres-R centric. 
However, I still think the simplicity and the level of generalization of 
imessages, dynamic shared memory and to some extent even the background 
worker infrastructure makes these components potentionaly re-usable.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Tom Lane
Date:
Markus Wanner <markus@bluegap.ch> writes:
> On 08/26/2010 02:44 PM, Robert Haas wrote:
>> On the more general topic of imessages, I had one other thought that
>> might be worth considering.  Instead of using shared memory, what
>> about using a file that is shared between the sender and receiver?

> What would that buy us?

Not having to have a hard limit on the space for unconsumed messages?

> The current approach uses plain spinlocks, which are more efficient. 

Please note the coding rule that says that the code should not execute
more than a few straight-line instructions while holding a spinlock.
If you're copying long messages while holding the lock, I don't think
spinlocks are acceptable.
        regards, tom lane


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
On 08/26/2010 09:22 PM, Tom Lane wrote:
> Not having to have a hard limit on the space for unconsumed messages?

Ah, I see. However, spilling to disk is unwanted for the current use 
cases of imessages. Instead the sender needs to be able to deal with 
out-of-(that-specific-part-of-shared)-memory conditions.

> Please note the coding rule that says that the code should not execute
> more than a few straight-line instructions while holding a spinlock.
> If you're copying long messages while holding the lock, I don't think
> spinlocks are acceptable.

Writing the payload data for imessages to shared memory doesn't need any 
kind of lock. (Because the relevant chunk of shared memory got allocated 
via wamalloc, which grants the allocator exclusive control over the 
returned chunk). Only appending and removing (the pointer to the data) 
to and from the queue requires taking a spinlock. And I think that still 
qualifies.

However, your concern is valid for wamalloc, which is more critical in 
that regard.

Regards

Markus


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Thu, Aug 26, 2010 at 3:03 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> On the more general topic of imessages, I had one other thought that
>> might be worth considering.  Instead of using shared memory, what
>> about using a file that is shared between the sender and receiver?
>
> What would that buy us? (At the price of more system calls and disk I/O)?
> Remember that the current approach (IIRC) uses exactly one syscall to send a
> message: kill() to send the (multiplexed) signal. (Except on strange
> platforms or setups that don't have a user-space spinlock implementation and
> need to use system mutexes).

It wouldn't require you to preallocate a big chunk of shared memory
without knowing how much of it you'll actually need.  For example,
suppose we implement parallel query.  If the message queues can be
allocated on the fly, then you can just say
maximum_message_queue_size_per_backend = 16MB and that'll probably be
good enough for most installations.  On systems where parallel query
is not used (e.g. because they only have 1 or 2 processors) then it
costs nothing.  On systems where parallel query is used extensively
(e.g. because they have 32 processors), you'll allocate enough space
for the number of backends that actually need message buffers, and not
more than that.  Furthermore, if parallel query is used at some times
(say, for nightly reporting) but not others (say, for daily OLTP
queries), the buffers can be deallocated when the helper backends exit
(or paged out if they are idle), and that memory can be reclaimed for
other use.

In addition, it means that maximum_message_queue_size_per_backend (or
whatever it's called) can be changed on-the-fly; that is, it can be
PGC_SIGHUP rather than PGC_POSTMASTER.  Being able to change GUCs
without shutting down the postmaster is a *big deal* for people
running in 24x7 operations.  Even things like wal_level that aren't
apt to be changed more than once in a blue moon are a problem (once
you go from "not having a standby" to "having a standby", you're
unlikely to want to go backwards), and this would likely need more
tweaking.  You might find that you need more memory for better
throughput, or that you need to reclaim memory for other purposes.
Especially if it's a hard allocation for any number of backends,
rather than something that backends can allocate only as and when they
need it.

As to efficiency, the process is not much different once the initial
setup is completed.  Just because you write to a memory-mapped file
rather than a shared memory segment doesn't mean that you're
necessarily doing disk I/O.  On systems that support it, you could
also choose to map a named POSIX shm rather than a disk file.  Either
way, there might be a little more overhead at startup but that doesn't
seem so bad; presumably the amount of work that the worker is doing is
large compared to the overhead of a few system calls, or you're
probably in trouble anyway, since our process startup overhead is
pretty substantial already.  The only time it seems like the overhead
would be annoying is if a process is going to use this system, but
only lightly.  Doing the extra setup just to send one or two messages
might suck.  But maybe that just means this isn't the right mechanism
for those cases (e.g. the existing XID-wraparound logic should still
use signal multiplexing rather than this system).  I see the value of
this as being primarily for streaming big chunks of data, not so much
for sending individual, very short messages.

>> On
>> the other hand, for processes that only send and receive messages
>> occasionally, this might just be overkill (and overhead).  You'd be
>> just as well off wrapping the access to the file in an LWLock: the
>> reader takes the lock, reads the data, marks it read, and releases the
>> lock.  The writer takes the lock, writes data, and releases the lock.
>
> The current approach uses plain spinlocks, which are more efficient. Note
> that both, appending as well as removing from the queue are writing
> operations, from the point of view of the queue. So I don't think LWLocks
> buy you anything here, either.

I agree that this might not be useful.  We don't really have all the
message types defined yet, though, so it's hard to say.

> I understand the need to limit the amount of data in flight, but I don't
> think that sending any type of message should ever block. Messages are
> atomic in that regard. Either they are ready to be delivered (in entirety)
> or not. Thus the sender needs to hold back the message, if the recipient is
> overloaded. (Also note that currently imessages are bound to a maximum size
> of around 8 KB).

That's functionally equivalent to blocking, isn't it?  I think that's
just a question of what API you want to expose.

> It might be interesting to note that I've just implemented some kind of
> streaming mechanism *atop* of imessages for Postgres-R. A data stream gets
> fragmented into single messages. As you pointed out, there should be some
> kind of congestion control. However, in my case, that needs to cover the
> inter-node connection as well, not just imessages. So I think the solution
> to that problem needs to be found on a higher level. I.e. in the Postgres-R
> case, I want to limit the *overall* amount of recovery data that's pending
> for a certain node. Not just the amount that's pending on a certain stream
> of within the imessages system.

For replication, that might be the case, but for parallel query,
per-queue seems about right.  At any rate, no design we've discussed
will let individual queues grow without bound.

> Think of imessages as the IP between processes, while streaming of data
> needs something akin to TCP on top of it. (OTOH, this comparison is lacking,
> because imessages guarantee reliable and ordered delivery of messages).

You probably need this, but 8KB seems like a pretty small chunk size.
I think one of the advantages of a per-backend area is that you don't
need to worry so much about fragmentation.  If you only need in-order
message delivery, you can just use the whole thing as a big ring
buffer.  There's no padding or sophisticated allocation needed.  You
just need a pointer to the last byte read (P1), the last byte allowed
to be read (P2), and the last byte allocated (P3).  Writers take a
spinlock, advance P3, release the spinlock, write the message, take
the spinlock, advance P2, release the spinlock, and signal the reader.Readers take the spinlock, read P1 and P2,
releasethe spinlock, read 
the data, take the spinlock, advance P1, and release the spinlock.

You might still want to fragment chunks of data to avoid problems if,
say, two writers are streaming data to a single reader.  In that case,
if the messages were too large compared to the amount of buffer space
available, you might get poor utilization, or even starvation.  But I
would think you wouldn't need to worry about that until the message
size got fairly high.

> BTW: why do you think the data heavy messages are sensitive to concurrency
> problems? I found the control messages to be rather more sensitive, as state
> changes and timing for those control messages are trickier to deal with.

Well, what I was thinking about is the fact that data messages are
bigger.  If I'm writing a 16-byte message once a minute and the reader
and I block each other until the message is fully read or written,
it's not really that big of a deal.  If the same thing happens when
we're trying to continuously stream tuple data from one process to
another, it halves the throughput; we expect both processes to be
reading/writing almost constantly.

>> So I kind of wonder whether we ought to have two separate systems, one
>> for data and one for control, with someone different characteristics.
>> I notice that one of your bg worker patches is for OOO-messages.  I
>> apologize again for not having read through it, but how much does that
>> resemble separating the control and data channels?
>
> It's something that resides within the coordinator process exclusively and
> doesn't have much to do with imessages.

Oh, OK.

> As is evident, all of these decisions are rather Postgres-R centric.
> However, I still think the simplicity and the level of generalization of
> imessages, dynamic shared memory and to some extent even the background
> worker infrastructure makes these components potentionaly re-usable.

I think unicast messaging is really useful and I really want it, but
the requirement that it be done through dynamic shared memory
allocations feels very uncomfortable to me (as you've no doubt
gathered).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Thu, Aug 26, 2010 at 3:40 PM, Markus Wanner <markus@bluegap.ch> wrote:
> On 08/26/2010 09:22 PM, Tom Lane wrote:
>>
>> Not having to have a hard limit on the space for unconsumed messages?
>
> Ah, I see. However, spilling to disk is unwanted for the current use cases
> of imessages. Instead the sender needs to be able to deal with
> out-of-(that-specific-part-of-shared)-memory conditions.

Shared memory can be paged out, too, if it's not being used enough to
keep the OS from deciding to evict it.  And I/O to a mmap()'d file or
shared memory region can remain in RAM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Hi,

On 08/26/2010 11:57 PM, Robert Haas wrote:
> It wouldn't require you to preallocate a big chunk of shared memory

Agreed, you wouldn't have to allocate it in advance. We would still want 
a configurable upper limit. So this can be seen as another approach for 
an implementation of a dynamic allocator. (Which should be separate from 
the exact imessages implementation, just for the sake of modularization 
already, IMO).

> In addition, it means that maximum_message_queue_size_per_backend (or
> whatever it's called) can be changed on-the-fly; that is, it can be
> PGC_SIGHUP rather than PGC_POSTMASTER.

That's certainly a point. However, as you are proposing a solution to 
just one subsystem (i.e. imessages), I don't find it half as convincing.

If you are saying it *should* be possible to resize shared memory in a 
portable way, why not do it for *all* subsystems right away? I still 
remember Tom saying it's not something that's doable in a portable way. 
Why and how should it be possible for a per-backend basis? How portable 
is mmap() really? Why don't we use in in Postgres as of now?

I certainly think that these are orthogonal issues: whether to use fixed 
boundaries or to dynamically allocate the memory available is one thing, 
dynamic resizing is another. If the later is possible, I'm certainly not 
opposed to it. (But would still favor dynamic allocation).

> As to efficiency, the process is not much different once the initial
> setup is completed.

I fully agree to that.

I'm more concerned about ease of use for developers. Simply being able 
to alloc() from shared memory makes things easier than having to invent 
a separate allocation method for every subsystem, again and again (the 
argument that people are more used to multi-threaded argument).

> Doing the extra setup just to send one or two messages
> might suck.  But maybe that just means this isn't the right mechanism
> for those cases (e.g. the existing XID-wraparound logic should still
> use signal multiplexing rather than this system).  I see the value of
> this as being primarily for streaming big chunks of data, not so much
> for sending individual, very short messages.

I agree that simple signals don't need a full imessage. But as soon as 
you want to send some data (like which database to vacuum), or require 
the delivery guarantee (i.e. no single message gets lost, as opposed to 
signals), then imessages should be cheap enough.

>> The current approach uses plain spinlocks, which are more efficient. Note
>> that both, appending as well as removing from the queue are writing
>> operations, from the point of view of the queue. So I don't think LWLocks
>> buy you anything here, either.
>
> I agree that this might not be useful.  We don't really have all the
> message types defined yet, though, so it's hard to say.

What does the type of lock used have to do with message types? IMO it 
doesn't matter what kind of message or what size you want to send. For 
appending or removing a pointer to or from a message queue, a spinlock 
seems to be just the right thing to use.

>> I understand the need to limit the amount of data in flight, but I don't
>> think that sending any type of message should ever block. Messages are
>> atomic in that regard. Either they are ready to be delivered (in entirety)
>> or not. Thus the sender needs to hold back the message, if the recipient is
>> overloaded. (Also note that currently imessages are bound to a maximum size
>> of around 8 KB).
>
> That's functionally equivalent to blocking, isn't it?  I think that's
> just a question of what API you want to expose.

Hm.. well, yeah, depends on what level you are arguing. The imessages 
API can be used in a completely non-blocking fashion. So a process can 
theoretically do other work while waiting for messages.

For parallel querying, the helper/worker backends would probably need to 
block, if the origin backend is not ready to accept more data, yes. 
However, making it accept and process another job in the mean time seems 
hard to do. But not an imessages problem per se. (While with the above 
streaming layer I've mentioned, that would not be possible, because that 
blocks).

> For replication, that might be the case, but for parallel query,
> per-queue seems about right.  At any rate, no design we've discussed
> will let individual queues grow without bound.

Extend parallel querying to multiple nodes and you are back at the same 
requirement.

However, it's certainly something that can be done atop imessages. I'm 
unsure if doing it as part of imessages is a good thing or not. Given 
the above requirement, I don't currently think so. Using multiple queues 
with different priorities, as you proposed, would probably make it more 
feasible.

> You probably need this, but 8KB seems like a pretty small chunk size.

For node-internal messaging, I probably agree. Would need benchmarking, 
as it's a compromise between latency and overhead, IMO.

I've chosen 8KB so these messages (together with some GCS and other 
transport headers) presumably fit into ethernet jumbo frames. I'd argue 
that you'd want even smaller chunk sizes for 1500 byte MTUs, because I 
don't expect the GCS to do a better job at fragmenting, than we can do 
in the upper layer (i.e. without copying data and w/o additional latency 
when reassembling the packet). But again, maybe that should be 
benchmarked, first.

> I think one of the advantages of a per-backend area is that you don't
> need to worry so much about fragmentation.  If you only need in-order
> message delivery, you can just use the whole thing as a big ring
> buffer.

Hm.. interesting idea. It's similar to my initial implementation, except 
that I had only a single ring-buffer for all backends.

> There's no padding or sophisticated allocation needed.  You
> just need a pointer to the last byte read (P1), the last byte allowed
> to be read (P2), and the last byte allocated (P3).  Writers take a
> spinlock, advance P3, release the spinlock, write the message, take
> the spinlock, advance P2, release the spinlock, and signal the reader.

That would block parallel writers (i.e. only one process can write to 
the queue at any time).

> Readers take the spinlock, read P1 and P2, release the spinlock, read
> the data, take the spinlock, advance P1, and release the spinlock.

It would require copying data in case a process only needs to forward 
the message. That's a quick pointer dequeue and enqueue exercise ATM.

> You might still want to fragment chunks of data to avoid problems if,
> say, two writers are streaming data to a single reader.  In that case,
> if the messages were too large compared to the amount of buffer space
> available, you might get poor utilization, or even starvation.  But I
> would think you wouldn't need to worry about that until the message
> size got fairly high.

Some of the writers in Postgres-R allocate the chunk for the message in 
shared memory way before they send the message. I.e. during a write 
operation of a transaction that needs to be replicated, the backend 
allocates space for a message at the start of the operation, but only 
fills it with change set data during processing. That can possibly take 
quite a while.

Decoupling memory allocation from message queue management allows to do 
this without having to copy the data. The same holds true for forwarding 
a message.

> Well, what I was thinking about is the fact that data messages are
> bigger.  If I'm writing a 16-byte message once a minute and the reader
> and I block each other until the message is fully read or written,
> it's not really that big of a deal.  If the same thing happens when
> we're trying to continuously stream tuple data from one process to
> another, it halves the throughput; we expect both processes to be
> reading/writing almost constantly.

Agreed. Unlike the proposed ring-buffer approach, the separate allocator 
approach doesn't have that problem, because writing itself is fully 
parallelized, even to the same recipient.

> I think unicast messaging is really useful and I really want it, but
> the requirement that it be done through dynamic shared memory
> allocations feels very uncomfortable to me (as you've no doubt
> gathered).

Well, I on the other hand am utterly uncomfortable with having a 
separate solution for memory allocation per sub-system (and it 
definitely is an inherent problem to lots of our subsystems). Given the 
ubiquity of dynamic memory allocators, I don't really understand your 
discomfort.

Thanks for discussing, I always enjoy respectful disagreement.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Fri, Aug 27, 2010 at 2:17 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> In addition, it means that maximum_message_queue_size_per_backend (or
>> whatever it's called) can be changed on-the-fly; that is, it can be
>> PGC_SIGHUP rather than PGC_POSTMASTER.
>
> That's certainly a point. However, as you are proposing a solution to just
> one subsystem (i.e. imessages), I don't find it half as convincing.

What other subsystems are you imagining servicing with a dynamic
allocator?  If there were a big demand for this functionality, we
probably would have been forced to implement it already, but that's
not the case.  We've already discussed the fact that there are massive
problems with using it for something like shared_buffers, which is by
far the largest consumer of shared memory.

> If you are saying it *should* be possible to resize shared memory in a
> portable way, why not do it for *all* subsystems right away? I still
> remember Tom saying it's not something that's doable in a portable way.

I think it would be great if we could bring some more flexibility to
our memory management.  There are really two layers of problems here.
One is resizing the segment itself, and one is resizing structures
within the segment.  As far as I can tell, there is no portable API
that can be used to resize the shm itself.  For so long as that
remains the case, I am of the opinion that any meaningful resizing of
the objects within the shm is basically unworkable.  So we need to
solve that problem first.

There are a couple of possible solutions, which have been discussed
here in the past.  One very appealing option is to use POSIX shm
rather than sysv shm.  AFAICT, it is possible to portably resize a
POSIX shm using ftruncate(), though I am not sure to what extent this
is supported on Windows.  One significant advantage of using POSIX shm
is that the default limits for POSIX shm on many operating systems are
much higher than the corresponding limits for sysv shm; in fact, some
people have expressed the opinion that it might be worth making the
switch for that reason alone, since it is no secret that a default
value of 32MB or less for shared_buffers is not enough to get
reasonable performance on many modern systems.  I believe, however,
that Tom Lane thinks we need to get a bit more out of it than that to
make it worthwhile.  One obstacle to making the switch is that POSIX
shm does not provide a way to fetch the number of processes attached
to the shared memory segment, which is a critical part of our
infrastructure to prevent accidentally running multiple postmasters on
the same data directory at the same time.  Consequently, it seems hard
to see how we can make that switch completely.  At a minimum, we'll
probably need to maintain a small sysv shm for interlock purposes.

OK, so let's suppose we use POSIX shm for most of the shared memory
segment, and keep only our fixed-size data structures in the sysv shm.Then what?  Well, then we can potentially resize
it. Because we are 
using a process-based model, this will require some careful
gymnastics.  Let's say we're growing the shm.  The backend that is
initiating the operation will call ftruncate() and then send signal
all of the other backends (using a sinval message or a multiplexed
signal or some such mechanism) to unmap and remap the shared memory
segment.  Any failure to remap the shared memory segment is at least a
FATAL for that backend, and very likely a PANIC, so this had better
not be something we plan to do routinely - for example, we wouldn't
want to do this as a way of adapting to changing load conditions.  It
would probably be acceptable to do it in a situation such as a
postgresql.conf reload, to accommodate a change in the server
parameter that can't otherwise be changed without a restart, since the
worst case scenario is, well, we have to restart anyway.  Once all
that's done, it's safe to start allocating memory from the newly added
portion of the shm.  Conversely, if we want to shrink the shm, the
considerations are similar, but we have to do everything in the
opposite order.  First, we must ensure that the portion of the shm
we're about to release is unused.  Then, we tell all the backends to
unmap and remap it.  Once we've confirmed that they have done so, we
ftruncate() it to the new size.

Next, we have to think about how we're going to resize data structures
within this expandable shm.  Many of these structures are not things
that we can easily move without bringing the system to a halt.  For
example, it's difficult to see how you could change the base address
of shared buffers without ceasing all system activity, at which point
there's not really much advantage over just forcing a restart.
Similarly with LWLocks or the ProcArray.  And if you can't move them,
then how will you grow them if (as will likely be the case) there's
something immediately following them in memory.  One possible solution
is to divide up these data structures into "slabs".  For example, we
might imagine allocating shared_buffers in 1GB chunks.  To make this
work, we'd need to change the memory layout so that each chunk would
include all of the miscellaneous stuff that we need to do bookkeeping
for that chunk, such as the LWLocks and buffer descriptors.  That
doesn't seem completely impossible, but there would be some
performance penalty, because you could no longer index into shared
buffers from a single base offset.  Instead, you'd need to determine
which chunk contains the buffer you want, look up the base address for
that chunk, and then index into the chunk.  Maybe that overhead
wouldn't be significant (or maybe it would); at any rate, it's not
completely free.  There's also the problem of handling the partial
chunk at the end, especially if that happens to be the only chunk.

I think the problems for other arrays are similar, or more severe.  I
can't see, for example, how you could resize the ProcArray using this
approach.  If you want to deallocate a chunk of shared buffers, it's
not impossible to imagine an algorithm for relocating any dirty
buffers in the segment to be deallocated into the remaining available
space, and then chucking the ones that are not dirty.  It might not be
real cheap, but that's not the same thing as not possible.  On the
other hand, changing the backend ID of a process in flight seems
intractable.  Maybe it's not.  Or maybe there is some other approach
to resizing these data structures that can work, but it's not real
clear to me what it is.

So basically my feeling is that reworking our memory allocation in
general, while possibly worthwhile, is a whole lot of work.  If we
focus on getting imessages done in the most direct fashion possible,
it seems like the sort of things that could get done in six months to
a year.  If we take the approach of reworking our whole approach to
memory allocation first, I think it will take several years.  Assuming
the problems discussed above aren't totally intractable, I'd be in
favor of solving them, because I think we can get some collateral
benefits out of it that would be nice to have.  However, it's
definitely a much larger project.

> Why
> and how should it be possible for a per-backend basis?

If we're designing a completely new subsystem, we have a lot more
design flexibility, because we needn't worry about interactions with
the existing users of shared memory.  Resizing an arena that is only
used for imessages is a lot more straightforward than resizing the
main shared memory arena.  If you can't remap the main shared memory
chunk, you won't be able to properly clean up your state while
exiting, and so a PANIC is forced.  But if you can't remap the
imessages chunk, and particularly if it only contains messages that
were addressed to you, then you should be able to get by with FATAL,
which is certainly a good thing from a system robustness point of
view.  And you might not even need to remap it.  The main reason
(although perhaps not the only reason) that someone would likely want
to vary a global allocation for parallel query or replication is if
they changed from "not using that feature" to "using it", or perhaps
from "using it" to "using it more heavily".  If the allocations are
per-backend and can be made on the fly, that problem goes away.

As long as we keep the shared memory area used for imessages/dynamic
allocation separate from, and independent of, the main shm, we can
still gain many of the same advantages - in particular, not PANICing
if a remap fails, and being able to resize the thing on the fly.
However, I believe that the implementation will be more complex if the
area is not per-backend.  Resizing is almost certainly a necessity in
this case, for the reasons discussed above, and that will have to be
done by having all backends unmap and remap the area in a coordinated
fashion, so it will be more disruptive than unmapping and remapping a
message queue for a single backend, where you only need to worry about
the readers and writers for that particular queue.  Also, you now have
to worry about fragmentation: a simple ring buffer is great if you're
processing messages on a FIFO basis, but when you have multiple
streams of messages with different destinations, it's probably not a
great solution.

> How portable is
> mmap() really? Why don't we use in in Postgres as of now?

I believe that mmap() is very portable, though there are other people
on this list who know more about exotic, crufty platforms than I do.
I discussed the question of why it's not used for our current shared
memory segment above - no nattch interlock.

>> As to efficiency, the process is not much different once the initial
>> setup is completed.
>
> I fully agree to that.
>
> I'm more concerned about ease of use for developers. Simply being able to
> alloc() from shared memory makes things easier than having to invent a
> separate allocation method for every subsystem, again and again (the
> argument that people are more used to multi-threaded argument).

This goes back to my points further up: what else do you think this
could be used for?  I'm much less optimistic about this being reusable
than you are, and I'd like to hear some concrete examples of other use
cases.

>>> The current approach uses plain spinlocks, which are more efficient. Note
>>> that both, appending as well as removing from the queue are writing
>>> operations, from the point of view of the queue. So I don't think LWLocks
>>> buy you anything here, either.
>>
>> I agree that this might not be useful.  We don't really have all the
>> message types defined yet, though, so it's hard to say.
>
> What does the type of lock used have to do with message types? IMO it
> doesn't matter what kind of message or what size you want to send. For
> appending or removing a pointer to or from a message queue, a spinlock seems
> to be just the right thing to use.

Well, it's certainly nice, if you can make it work.  I haven't really
thought about all the cases, though.  The main advantages of LWLocks
is that you can take them in either shared or exclusive mode, and that
you can hold them for more than a handful of instructions.  If we're
trying to design a really *simple* system for message passing, LWLocks
might be just right.  Take the lock, read or write the message,
release the lock.  But it seems like that's not really the case we're
trying to optimize for, so this may be a dead-end.

>> You probably need this, but 8KB seems like a pretty small chunk size.
>
> For node-internal messaging, I probably agree. Would need benchmarking, as
> it's a compromise between latency and overhead, IMO.
>
> I've chosen 8KB so these messages (together with some GCS and other
> transport headers) presumably fit into ethernet jumbo frames. I'd argue that
> you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't
> expect the GCS to do a better job at fragmenting, than we can do in the
> upper layer (i.e. without copying data and w/o additional latency when
> reassembling the packet). But again, maybe that should be benchmarked,
> first.

Yeah, probably.  I think designing something that works efficiently
over a network is a somewhat different problem than designing
something that works on an individual node, and we probably shouldn't
let the designs influence each other too much.

>> There's no padding or sophisticated allocation needed.  You
>> just need a pointer to the last byte read (P1), the last byte allowed
>> to be read (P2), and the last byte allocated (P3).  Writers take a
>> spinlock, advance P3, release the spinlock, write the message, take
>> the spinlock, advance P2, release the spinlock, and signal the reader.
>
> That would block parallel writers (i.e. only one process can write to the
> queue at any time).

I feel like there's probably some variant of this idea that works
around that problem.  The problem is that when a worker finishes
writing a message, he needs to know whether to advance P2 only over
his own message or also over some subsequent message that has been
fully written in the meantime.  I don't know exactly how to solve that
problem off the top of my head, but it seems like it might be
possible.

>> Readers take the spinlock, read P1 and P2, release the spinlock, read
>> the data, take the spinlock, advance P1, and release the spinlock.
>
> It would require copying data in case a process only needs to forward the
> message. That's a quick pointer dequeue and enqueue exercise ATM.

If we need to do that, that's a compelling argument for having a
single messaging area rather than one per backend.  But I'm not sure I
see why we would need that sort of capability.  Why wouldn't you just
arrange for the sender to deliver the message directly to the final
recipient?

>> You might still want to fragment chunks of data to avoid problems if,
>> say, two writers are streaming data to a single reader.  In that case,
>> if the messages were too large compared to the amount of buffer space
>> available, you might get poor utilization, or even starvation.  But I
>> would think you wouldn't need to worry about that until the message
>> size got fairly high.
>
> Some of the writers in Postgres-R allocate the chunk for the message in
> shared memory way before they send the message. I.e. during a write
> operation of a transaction that needs to be replicated, the backend
> allocates space for a message at the start of the operation, but only fills
> it with change set data during processing. That can possibly take quite a
> while.

So, they know in advance how large the message will be but not what
the contents will be?  What are they doing?

>> I think unicast messaging is really useful and I really want it, but
>> the requirement that it be done through dynamic shared memory
>> allocations feels very uncomfortable to me (as you've no doubt
>> gathered).
>
> Well, I on the other hand am utterly uncomfortable with having a separate
> solution for memory allocation per sub-system (and it definitely is an
> inherent problem to lots of our subsystems). Given the ubiquity of dynamic
> memory allocators, I don't really understand your discomfort.

Well, the fact that something is commonly used doesn't mean it's right
for us.  Tabula raza, we might design the whole system differently,
but changing it now is not to be undertaken lightly.  Hopefully the
above comments shed some light on my concerns.  In short, (1) I don't
want to preallocate a big chunk of memory we might not use, (2) I fear
reducing the overall robustness of the system, and (3) I'm uncertain
what other systems would be able leverage a dynamic allocator of the
sort you propose.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Hi,

On 08/27/2010 10:46 PM, Robert Haas wrote:
> What other subsystems are you imagining servicing with a dynamic
> allocator?  If there were a big demand for this functionality, we
> probably would have been forced to implement it already, but that's
> not the case.  We've already discussed the fact that there are massive
> problems with using it for something like shared_buffers, which is by
> far the largest consumer of shared memory.

Understood. I certainly plan to look into that for a better 
understanding of the problems those pose for dynamically allocated memory.

> I think it would be great if we could bring some more flexibility to
> our memory management.  There are really two layers of problems here.

Full ACK.

> One is resizing the segment itself, and one is resizing structures
> within the segment.  As far as I can tell, there is no portable API
> that can be used to resize the shm itself.  For so long as that
> remains the case, I am of the opinion that any meaningful resizing of
> the objects within the shm is basically unworkable.  So we need to
> solve that problem first.

Why should resizing of the objects within the shmem be unworkable? 
Doesn't my patch(es) prove the exact opposite? Being able to resize 
"objects" within the shm requires some kind of underlying dynamic 
allocation. And I rather like to be in control of that allocator than 
having to deal with two dozen different implementations on different 
OSes and their libraries.

> There are a couple of possible solutions, which have been discussed
> here in the past.

I currently don't have much interest in dynamic resizing. Being able to 
resize the overall amount of shared memory on the fly would be nice, 
sure. But the total amount of RAM in a server changes rather 
infrequently. Being able to use what's available more efficiently is 
what I'm interested in. That doesn't need any kind of additional or 
different OS level support. It's just a matter of making better use of 
what's available - within Postgres itself.

> Next, we have to think about how we're going to resize data structures
> within this expandable shm.

Okay, that's where I'm getting interested.

> Many of these structures are not things
> that we can easily move without bringing the system to a halt.  For
> example, it's difficult to see how you could change the base address
> of shared buffers without ceasing all system activity, at which point
> there's not really much advantage over just forcing a restart.
> Similarly with LWLocks or the ProcArray.

I guess that's what Bruce wanted to point out by saying our data 
structures are mostly "continuous". I.e. not dynamic lists or hash 
tables, but plain simple arrays.

Maybe that's a subjective impression, but I seem to hear complaints 
about their fixed size and inflexibility quite often. Try to imagine the 
flexibility that dynamic lists could give us.

> And if you can't move them,
> then how will you grow them if (as will likely be the case) there's
> something immediately following them in memory.  One possible solution
> is to divide up these data structures into "slabs".  For example, we
> might imagine allocating shared_buffers in 1GB chunks.

Why 1GB and do yet another layer of dynamic allocation within that? The 
buffers are (by default) 8K, so allocate in chunks of 8K. Or a tiny bit 
more for all of the book-keeping stuff.

> To make this
> work, we'd need to change the memory layout so that each chunk would
> include all of the miscellaneous stuff that we need to do bookkeeping
> for that chunk, such as the LWLocks and buffer descriptors.  That
> doesn't seem completely impossible, but there would be some
> performance penalty, because you could no longer index into shared
> buffers from a single base offset.

AFAICT we currently have three fixed size blocks to manage shared 
buffers: the buffer blocks themselves, the buffer descriptors, the 
strategy status (for the freelist) and the buffer lookup table.

It's not obvious to me how these data structures should perform better 
than a dynamically allocated layout. One could rather argue that 
combining (some of) the bookkeeping stuff with data itself would lead to 
better locality and thus perform better.

> Instead, you'd need to determine
> which chunk contains the buffer you want, look up the base address for
> that chunk, and then index into the chunk.  Maybe that overhead
> wouldn't be significant (or maybe it would); at any rate, it's not
> completely free.  There's also the problem of handling the partial
> chunk at the end, especially if that happens to be the only chunk.

This sounds way too complicated, yes. Use 8K chunks and most of the 
problems vanish.

> I think the problems for other arrays are similar, or more severe.  I
> can't see, for example, how you could resize the ProcArray using this
> approach.

Try not to think in terms of resizing, but dynamic allocation. I.e. 
being able to resize ProcArray (and thus being able to alter 
max_connections on the fly) would take a lot more work.

Just using the unoccupied space of the ProcArray for other subsystems 
that need it more urgently could be done much easier. Again, you'd want 
to allocate a single PGPROC at a time.

(And yes, the benefits aren't as significant as for shared_buffers, 
simply because PGPROC doesn't occupy that much memory).

> If you want to deallocate a chunk of shared buffers, it's
> not impossible to imagine an algorithm for relocating any dirty
> buffers in the segment to be deallocated into the remaining available
> space, and then chucking the ones that are not dirty.

Please use the dynamic allocator for that. Don't duplicate that again. 
Those allocators are designed for efficiently allocating small chunks, 
down to a few bytes.

> It might not be
> real cheap, but that's not the same thing as not possible.  On the
> other hand, changing the backend ID of a process in flight seems
> intractable.  Maybe it's not.  Or maybe there is some other approach
> to resizing these data structures that can work, but it's not real
> clear to me what it is.

Changing to a dynamically allocated memory model certainly requires some 
thought and lots of work. Yes. It's not for free.

> So basically my feeling is that reworking our memory allocation in
> general, while possibly worthwhile, is a whole lot of work.

Exactly.

> If we
> focus on getting imessages done in the most direct fashion possible,
> it seems like the sort of things that could get done in six months to
> a year.

Well, it works for Postgres-R as it is, so imessages already exists 
without a single additional month. And I don't intend to change it back 
to something that couldn't use a dynamic allocator. I already run into 
too many problems that way, see below.

> If we take the approach of reworking our whole approach to
> memory allocation first, I think it will take several years.  Assuming
> the problems discussed above aren't totally intractable, I'd be in
> favor of solving them, because I think we can get some collateral
> benefits out of it that would be nice to have.  However, it's
> definitely a much larger project.

Agreed.

> If the allocations are
> per-backend and can be made on the fly, that problem goes away.

That might hold true for imessages, which simply loose importance once 
the (recipient) backend vanishes. But other shared memory stuff, that 
would rather complicate shared memory access.

> As long as we keep the shared memory area used for imessages/dynamic
> allocation separate from, and independent of, the main shm, we can
> still gain many of the same advantages - in particular, not PANICing
> if a remap fails, and being able to resize the thing on the fly.

Separate sub-system allocators, separate code, separate bugs, lots more 
work. Please not. KISS.

> However, I believe that the implementation will be more complex if the
> area is not per-backend.  Resizing is almost certainly a necessity in
> this case, for the reasons discussed above

I disagree and see the main reason in making better use of the available 
resources. Resizing will loose lots of importance, once you can 
dynamically adjust boundaries between subsystem's use of the single, 
huge, fixed-size shmem chunk allocated at start.

> and that will have to be
> done by having all backends unmap and remap the area in a coordinated
> fashion,

That's assuming resizing capability.

> so it will be more disruptive than unmapping and remapping a
> message queue for a single backend, where you only need to worry about
> the readers and writers for that particular queue.

And that's assuming a separate allocation method for the imessages 
sub-system.

> Also, you now have
> to worry about fragmentation: a simple ring buffer is great if you're
> processing messages on a FIFO basis, but when you have multiple
> streams of messages with different destinations, it's probably not a
> great solution.

Exactly, that's where dynamic allocation shows its real advantages. No 
silly ring buffers required.

> This goes back to my points further up: what else do you think this
> could be used for?  I'm much less optimistic about this being reusable
> than you are, and I'd like to hear some concrete examples of other use
> cases.

Sure. And well understood. I'll try to take a try at converting 
shared_buffers.

> Well, it's certainly nice, if you can make it work.  I haven't really
> thought about all the cases, though.  The main advantages of LWLocks
> is that you can take them in either shared or exclusive mode

As mentioned, the message queue has write accesses exclusively (enqueue 
and dequeue), so that's unneeded overhead.

> and that
> you can hold them for more than a handful of instructions.

Neither of the two operations needs more than a handful of instructions, 
so that's plain overhead as well.

> If we're
> trying to design a really *simple* system for message passing, LWLocks
> might be just right.  Take the lock, read or write the message,
> release the lock.

That's exactly how easy is is *with* the dynamic allocator: take the 
(even simpler) spin lock, enqueue (or dequeue) the message, release the 
lock again.

No locking required for writing or reading the message. Independent (and 
well multi-process capable / safe) alloc and free routines for memory 
management. That get called *before* writing the message and *after* 
reading it.

Mangling memory allocation with queue management is a lot more 
complicated to design and understand. And less efficient

> But it seems like that's not really the case we're
> trying to optimize for, so this may be a dead-end.
>
>>> You probably need this, but 8KB seems like a pretty small chunk size.
>>
>> For node-internal messaging, I probably agree. Would need benchmarking, as
>> it's a compromise between latency and overhead, IMO.
>>
>> I've chosen 8KB so these messages (together with some GCS and other
>> transport headers) presumably fit into ethernet jumbo frames. I'd argue that
>> you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't
>> expect the GCS to do a better job at fragmenting, than we can do in the
>> upper layer (i.e. without copying data and w/o additional latency when
>> reassembling the packet). But again, maybe that should be benchmarked,
>> first.
>
> Yeah, probably.  I think designing something that works efficiently
> over a network is a somewhat different problem than designing
> something that works on an individual node, and we probably shouldn't
> let the designs influence each other too much.
>
>>> There's no padding or sophisticated allocation needed.  You
>>> just need a pointer to the last byte read (P1), the last byte allowed
>>> to be read (P2), and the last byte allocated (P3).  Writers take a
>>> spinlock, advance P3, release the spinlock, write the message, take
>>> the spinlock, advance P2, release the spinlock, and signal the reader.
>>
>> That would block parallel writers (i.e. only one process can write to the
>> queue at any time).
>
> I feel like there's probably some variant of this idea that works
> around that problem.  The problem is that when a worker finishes
> writing a message, he needs to know whether to advance P2 only over
> his own message or also over some subsequent message that has been
> fully written in the meantime.  I don't know exactly how to solve that
> problem off the top of my head, but it seems like it might be
> possible.
>
>>> Readers take the spinlock, read P1 and P2, release the spinlock, read
>>> the data, take the spinlock, advance P1, and release the spinlock.
>>
>> It would require copying data in case a process only needs to forward the
>> message. That's a quick pointer dequeue and enqueue exercise ATM.
>
> If we need to do that, that's a compelling argument for having a
> single messaging area rather than one per backend.  But I'm not sure I
> see why we would need that sort of capability.  Why wouldn't you just
> arrange for the sender to deliver the message directly to the final
> recipient?
>
>>> You might still want to fragment chunks of data to avoid problems if,
>>> say, two writers are streaming data to a single reader.  In that case,
>>> if the messages were too large compared to the amount of buffer space
>>> available, you might get poor utilization, or even starvation.  But I
>>> would think you wouldn't need to worry about that until the message
>>> size got fairly high.
>>
>> Some of the writers in Postgres-R allocate the chunk for the message in
>> shared memory way before they send the message. I.e. during a write
>> operation of a transaction that needs to be replicated, the backend
>> allocates space for a message at the start of the operation, but only fills
>> it with change set data during processing. That can possibly take quite a
>> while.
>
> So, they know in advance how large the message will be but not what
> the contents will be?  What are they doing?
>
>>> I think unicast messaging is really useful and I really want it, but
>>> the requirement that it be done through dynamic shared memory
>>> allocations feels very uncomfortable to me (as you've no doubt
>>> gathered).
>>
>> Well, I on the other hand am utterly uncomfortable with having a separate
>> solution for memory allocation per sub-system (and it definitely is an
>> inherent problem to lots of our subsystems). Given the ubiquity of dynamic
>> memory allocators, I don't really understand your discomfort.
>
> Well, the fact that something is commonly used doesn't mean it's right
> for us.  Tabula raza, we might design the whole system differently,
> but changing it now is not to be undertaken lightly.  Hopefully the
> above comments shed some light on my concerns.  In short, (1) I don't
> want to preallocate a big chunk of memory we might not use, (2) I fear
> reducing the overall robustness of the system, and (3) I'm uncertain
> what other systems would be able leverage a dynamic allocator of the
> sort you propose.
>



Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
(Sorry, need to disable Ctrl-Return, which quite often sends mails 
earlier than I really want.. continuing my mail)

On 08/27/2010 10:46 PM, Robert Haas wrote:
> Yeah, probably.  I think designing something that works efficiently
> over a network is a somewhat different problem than designing
> something that works on an individual node, and we probably shouldn't
> let the designs influence each other too much.

Agreed. Thus I've left out any kind of congestion avoidance stuff from 
imessages so far.

>>> There's no padding or sophisticated allocation needed.  You
>>> just need a pointer to the last byte read (P1), the last byte allowed
>>> to be read (P2), and the last byte allocated (P3).  Writers take a
>>> spinlock, advance P3, release the spinlock, write the message, take
>>> the spinlock, advance P2, release the spinlock, and signal the reader.
>>
>> That would block parallel writers (i.e. only one process can write to the
>> queue at any time).
>
> I feel like there's probably some variant of this idea that works
> around that problem.  The problem is that when a worker finishes
> writing a message, he needs to know whether to advance P2 only over
> his own message or also over some subsequent message that has been
> fully written in the meantime.  I don't know exactly how to solve that
> problem off the top of my head, but it seems like it might be
> possible.

I've tried pretty much that before. And failed. Because the 
allocation-order (i.e. the time the message gets created in preparation 
for writing to it) isn't necessarily the same as the sending-order (i.e. 
when the process has finished writing and decides to send the message).

To satisfy the FIFO property WRT the sending order, you need to decouple 
allocation form the ordering (i.e. queuing logic).

(And yes, it has taken me a while to figure out what's wrong in 
Postgres-R, before I've even noticed about that design bug).

>>> Readers take the spinlock, read P1 and P2, release the spinlock, read
>>> the data, take the spinlock, advance P1, and release the spinlock.
>>
>> It would require copying data in case a process only needs to forward the
>> message. That's a quick pointer dequeue and enqueue exercise ATM.
>
> If we need to do that, that's a compelling argument for having a
> single messaging area rather than one per backend.

Absolutely, yes.

> But I'm not sure I
> see why we would need that sort of capability.  Why wouldn't you just
> arrange for the sender to deliver the message directly to the final
> recipient?

A process can read and even change the data of the message before 
forwarding it. Something the coordinator in Postgres-R does sometimes. 
(As it is the interface to the GCS and thus to the rest of the nodes in 
the cluster).

For parallel querying (on a single node) that's probably less important 
a feature.

> So, they know in advance how large the message will be but not what
> the contents will be?  What are they doing?

Filling the message until it's (mostly) full and then continue with the 
next one. At least that's how the streaming approach on top of imessages 
works.

But yes, it's somewhat annoying to have to know the message size in 
advance. I didn't implement realloc so far. Nor can I think of any other 
solution. Note that separation of allocation and queue ordering is 
required anyway for the above reasons.

> Well, the fact that something is commonly used doesn't mean it's right
> for us.  Tabula raza, we might design the whole system differently,
> but changing it now is not to be undertaken lightly.  Hopefully the
> above comments shed some light on my concerns.  In short, (1) I don't
> want to preallocate a big chunk of memory we might not use,

Isn't that's exactly what we do now for lots of sub-systems, and what 
I'd like to improve (i.e. reduce to a single big chunk).

> (2) I fear
> reducing the overall robustness of the system, and

Well, that applies to pretty much every new feature you add.

> (3) I'm uncertain
> what other systems would be able leverage a dynamic allocator of the
> sort you propose.

Okay, that's up to me to show evidences (or at least a PoC).

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Tom Lane
Date:
Markus Wanner <markus@bluegap.ch> writes:
> AFAICT we currently have three fixed size blocks to manage shared 
> buffers: the buffer blocks themselves, the buffer descriptors, the 
> strategy status (for the freelist) and the buffer lookup table.

> It's not obvious to me how these data structures should perform better 
> than a dynamically allocated layout.

Let me just point out that awhile back we got a *measurable* performance
boost by eliminating a single indirect fetch from the buffer addressing
code path.  We used to have an array of pointers pointing to the actual
buffers, and we removed that in favor of assuming the buffers were
laid out in a contiguous array, so that the address of buffer N could be
computed with a shift-and-add, eliminating the pointer fetch.  I forget
exactly what the numbers were, but it was significant enough to make us
change it.

So I don't have any faith in untested assertions that we can convert
these data structures to use dynamic allocation with no penalty.
It's very difficult to see how you'd do that without introducing a
new layer of indirection, and our experience shows that that layer
will cost you.
        regards, tom lane


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Hi,

On 08/30/2010 04:52 PM, Tom Lane wrote:
> Let me just point out that awhile back we got a *measurable* performance
> boost by eliminating a single indirect fetch from the buffer addressing
> code path.

I'll take a look a that, thanks.

> So I don't have any faith in untested assertions

Neither do I. Thus I'm probably going to try my approach.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Mon, Aug 30, 2010 at 11:30 AM, Markus Wanner <markus@bluegap.ch> wrote:
> On 08/30/2010 04:52 PM, Tom Lane wrote:
>> Let me just point out that awhile back we got a *measurable* performance
>> boost by eliminating a single indirect fetch from the buffer addressing
>> code path.
>
> I'll take a look a that, thanks.
>
>> So I don't have any faith in untested assertions
> Neither do I. Thus I'm probably going to try my approach.

As a matter of project management, I am inclined to think that until
we've hammered out this issue, there's not a whole lot useful that can
be done on any of the BG worker patches.  So I am wondering if we
should set those to Returned with Feedback or bump them to a future
CommitFest.

The good news is that, after a lot of back and forth, I think we've
identified the reason underpinning much of why Markus and I have been
disagreeing about dynshmem and imessages - namely, whether or not it's
possible to allocate shared_buffers as something other than one giant
slab without taking an unacceptable performance hit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
On 09/14/2010 06:26 PM, Robert Haas wrote:
> As a matter of project management, I am inclined to think that until
> we've hammered out this issue, there's not a whole lot useful that can
> be done on any of the BG worker patches.  So I am wondering if we
> should set those to Returned with Feedback or bump them to a future
> CommitFest.

I agree in general. I certainly don't want to hold back the commit fest.

What bugs me a bit is that I didn't really get much feedback regarding 
the *bgworker* portion of code. Especially as that's the part I'm most 
interested in feedback.

However, I currently don't have any time to work on these patches, so 
I'm fine with dropping them from the current commit fest.

> The good news is that, after a lot of back and forth, I think we've
> identified the reason underpinning much of why Markus and I have been
> disagreeing about dynshmem and imessages - namely, whether or not it's
> possible to allocate shared_buffers as something other than one giant
> slab without taking an unacceptable performance hit.

Agreed.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Alvaro Herrera
Date:
Excerpts from Markus Wanner's message of mar sep 14 12:56:59 -0400 2010:

> What bugs me a bit is that I didn't really get much feedback regarding 
> the *bgworker* portion of code. Especially as that's the part I'm most 
> interested in feedback.

I think we've had enough problems with the current design of forking a
new autovac process every once in a while, that I'd like to have them as
permanent processes instead, waiting for orders from the autovac
launcher.  From that POV, bgworkers would make sense.

I cannot promise a timely review however :-(

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: bg worker: patch 1 of 6 - permanent process

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I think we've had enough problems with the current design of forking a
> new autovac process every once in a while, that I'd like to have them as
> permanent processes instead, waiting for orders from the autovac
> launcher.  From that POV, bgworkers would make sense.

That seems like a fairly large can of worms to open: we have never tried
to make backends switch from one database to another, and I don't think
I'd want to start such a project with autovac.
        regards, tom lane


Re: bg worker: patch 1 of 6 - permanent process

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of mar sep 14 13:46:17 -0400 2010:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > I think we've had enough problems with the current design of forking a
> > new autovac process every once in a while, that I'd like to have them as
> > permanent processes instead, waiting for orders from the autovac
> > launcher.  From that POV, bgworkers would make sense.
> 
> That seems like a fairly large can of worms to open: we have never tried
> to make backends switch from one database to another, and I don't think
> I'd want to start such a project with autovac.

Yeah, what I was thinking is that each worker would still die after
completing the run, but a new one would be started immediately; it would
go to sleep until a new assignment arrived.  (What got me into this was
the whole latch thing, actually.)

This is a very raw idea however, so don't mind me much.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: bg worker: patch 1 of 6 - permanent process

From
Tom Lane
Date:
Markus Wanner <markus@bluegap.ch> writes:
> On 09/14/2010 07:46 PM, Tom Lane wrote:
>> That seems like a fairly large can of worms to open: we have never tried
>> to make backends switch from one database to another, and I don't think
>> I'd want to start such a project with autovac.

> They don't. Even with bgworker, every backend stays connected to the 
> same backend. You configure the min and max amounts of idle backends 
> *per database*. Plus the overall max of background workers, IIRC.

So there is a minimum of one avworker per database?  That's a guaranteed
nonstarter.  There are many people with thousands of databases, but no
need for thousands of avworkers.

I'm also pretty unclear why you speak of min and max numbers of workers
when the proposal (AIUI) is to have the workers there always, rather
than have them come and go.
        regards, tom lane


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Hi,

On 09/14/2010 07:46 PM, Tom Lane wrote:
> Alvaro Herrera<alvherre@commandprompt.com>  writes:
>> I think we've had enough problems with the current design of forking a
>> new autovac process every once in a while, that I'd like to have them as
>> permanent processes instead, waiting for orders from the autovac
>> launcher.  From that POV, bgworkers would make sense.

Okay, great.

> That seems like a fairly large can of worms to open: we have never tried
> to make backends switch from one database to another, and I don't think
> I'd want to start such a project with autovac.

They don't. Even with bgworker, every backend stays connected to the 
same backend. You configure the min and max amounts of idle backends 
*per database*. Plus the overall max of background workers, IIRC.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Hi,

I'm glad discussion on this begins.

On 09/14/2010 07:55 PM, Tom Lane wrote:
> So there is a minimum of one avworker per database?

Nope, you can set that to 0. You don't *need* to keep idle workers around.

> That's a guaranteed
> nonstarter.  There are many people with thousands of databases, but no
> need for thousands of avworkers.

Well, yeah, bgworkers are primarily designed to be used for Postgres-R, 
where you easily get more background workers than normal backends. And 
having idle backends around waiting for a next job is certainly 
preferable over having to re-connect every time.

I've advertised the bgworker infrastructure for use for parallel 
querying as well. Again, that use case easily leads to having more 
background workers than normal backends. And you don't want to wait for 
them all to re-connect for every query they need to help with.

> I'm also pretty unclear why you speak of min and max numbers of workers
> when the proposal (AIUI) is to have the workers there always, rather
> than have them come and go.

This step 1 of the bgworker set of patches turns the av*launcher* 
(coordinator) into a permanent process (even if autovacuum is off).

The background workers can still come and go. However, they don't 
necessarily *need* to terminate after having done their job. The 
coordinator controls them and requests new workers or commands idle ones 
to terminate *as required*.

I don't think there's that much different to the current implementation. 
Setting both, the min and max number of idle bgworkers to 0 should in 
fact give you the exact same behavior as we currently have: namely to 
terminate each av/bgworker after its job is done, never having idle 
workers around. Which might or might not be the optimal configuration 
for users with lots of databases, that's hard to predict. And it depends 
a lot on the load distribution over the databases and on how clever the 
coordinator manages the bgworkers.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
On 09/14/2010 08:06 PM, Robert Haas wrote:
> One idea I had was to have autovacuum workers stick around for a
> period of time after finishing their work.  When we need to autovacuum
> a database, first check whether there's an existing worker that we can
> use, and if so use him.  If not, start a new one.  If that puts us
> over the max number of workers, kill of the one that's been waiting
> the longest.  But workers will exit anyway if not reused after a
> certain period of time.

That's pretty close to how bgworkers are implemented, now. Except for 
the need to terminate after a certain period of time. What is that 
intended to be good for?

Especially considering that the avlauncher/coordinator knows the current 
amount of work (number of jobs) per database.

> The idea here would be to try to avoid all the backend startup costs:
> process creation, priming the caches, etc.  But I'm not really sure
> it's worth the effort.  I think we need to look for ways to further
> reduce the overhead of vacuuming, but this doesn't necessarily seem
> like the thing that would have the most bang for the buck.

Well, the pressure has simply been bigger for Postgres-R.

It should be possible to do benchmarks using Postgres-R and compare 
against a max_idle_background_workers = 0 configuration that leads to 
termination and re-connecting for ever remote transaction to be applied. 
However, that's not going to say anything about whether or not it's 
worth it for autovacuum.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Tue, Sep 14, 2010 at 1:56 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Tom Lane's message of mar sep 14 13:46:17 -0400 2010:
>> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> > I think we've had enough problems with the current design of forking a
>> > new autovac process every once in a while, that I'd like to have them as
>> > permanent processes instead, waiting for orders from the autovac
>> > launcher.  From that POV, bgworkers would make sense.
>>
>> That seems like a fairly large can of worms to open: we have never tried
>> to make backends switch from one database to another, and I don't think
>> I'd want to start such a project with autovac.
>
> Yeah, what I was thinking is that each worker would still die after
> completing the run, but a new one would be started immediately; it would
> go to sleep until a new assignment arrived.  (What got me into this was
> the whole latch thing, actually.)
>
> This is a very raw idea however, so don't mind me much.

What would be the advantage of that?

One idea I had was to have autovacuum workers stick around for a
period of time after finishing their work.  When we need to autovacuum
a database, first check whether there's an existing worker that we can
use, and if so use him.  If not, start a new one.  If that puts us
over the max number of workers, kill of the one that's been waiting
the longest.  But workers will exit anyway if not reused after a
certain period of time.

The idea here would be to try to avoid all the backend startup costs:
process creation, priming the caches, etc.  But I'm not really sure
it's worth the effort.  I think we need to look for ways to further
reduce the overhead of vacuuming, but this doesn't necessarily seem
like the thing that would have the most bang for the buck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Tue, Sep 14, 2010 at 2:26 PM, Markus Wanner <markus@bluegap.ch> wrote:
> On 09/14/2010 08:06 PM, Robert Haas wrote:
>> One idea I had was to have autovacuum workers stick around for a
>> period of time after finishing their work.  When we need to autovacuum
>> a database, first check whether there's an existing worker that we can
>> use, and if so use him.  If not, start a new one.  If that puts us
>> over the max number of workers, kill of the one that's been waiting
>> the longest.  But workers will exit anyway if not reused after a
>> certain period of time.
>
> That's pretty close to how bgworkers are implemented, now. Except for the
> need to terminate after a certain period of time. What is that intended to
> be good for?

To avoid consuming system resources forever if they're not being used.

> Especially considering that the avlauncher/coordinator knows the current
> amount of work (number of jobs) per database.
>
>> The idea here would be to try to avoid all the backend startup costs:
>> process creation, priming the caches, etc.  But I'm not really sure
>> it's worth the effort.  I think we need to look for ways to further
>> reduce the overhead of vacuuming, but this doesn't necessarily seem
>> like the thing that would have the most bang for the buck.
>
> Well, the pressure has simply been bigger for Postgres-R.
>
> It should be possible to do benchmarks using Postgres-R and compare against
> a max_idle_background_workers = 0 configuration that leads to termination
> and re-connecting for ever remote transaction to be applied.

Well, presumably that would be fairly disastrous.  I would think,
though, that you would not have a min/max number of workers PER
DATABASE, but an overall limit on the upper size of the total pool - I
can't see any reason to limit the minimum size of the pool, but I
might be missing something.

> However, that's
> not going to say anything about whether or not it's worth it for autovacuum.

Personally, my position is that if someone does something that is only
a small improvement on its own but which has the potential to help
with other things later, that's a perfectly legitimate patch and we
should try to accept it.  But if it's not a clear (even if small) win
then the bar is a lot higher, at least in my book.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
On 09/14/2010 08:41 PM, Robert Haas wrote:
> To avoid consuming system resources forever if they're not being used.

Well, what timeout would you choose. And how would you justify it 
compared to the amounts of system resources consumed by an idle process 
sitting there and waiting for a job?

I'm not against such a timeout, but so far I felt that unlimited would 
be the best default.

> Well, presumably that would be fairly disastrous.  I would think,
> though, that you would not have a min/max number of workers PER
> DATABASE, but an overall limit on the upper size of the total pool

That already exists (in addition to the other parameters).

> - I
> can't see any reason to limit the minimum size of the pool, but I
> might be missing something.

I tried to mimic what others do, for example apache pre-fork. Maybe it's 
just another way of trying to keep the overall resource consumption at a 
reasonable level.

The minimum is helpful to eliminate waits for backends starting up. Note 
here that the coordinator can only request to fork one new bgworker at a 
time. It then needs to wait until that new bgworker registers with the 
coordinator, until it can request further bgworkers from the postmaster. 
(That's due to the limitation in communication between the postmaster 
and coordinator).

> Personally, my position is that if someone does something that is only
> a small improvement on its own but which has the potential to help
> with other things later, that's a perfectly legitimate patch and we
> should try to accept it.  But if it's not a clear (even if small) win
> then the bar is a lot higher, at least in my book.

I don't think it's an improvement over the current autovacuum behavior. 
Not intended to be one. But it certainly shouldn't hurt, either.

It only has the potential to help with other things, namely parallel 
querying. And of course replication (Postgres-R). Or any other kind of 
background job you come up with (where background means not requiring a 
client connection).

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Tue, Sep 14, 2010 at 2:59 PM, Markus Wanner <markus@bluegap.ch> wrote:
> On 09/14/2010 08:41 PM, Robert Haas wrote:
>>
>> To avoid consuming system resources forever if they're not being used.
>
> Well, what timeout would you choose. And how would you justify it compared
> to the amounts of system resources consumed by an idle process sitting there
> and waiting for a job?
>
> I'm not against such a timeout, but so far I felt that unlimited would be
> the best default.

I don't have a specific number in mind.  5 minutes?

>> Well, presumably that would be fairly disastrous.  I would think,
>> though, that you would not have a min/max number of workers PER
>> DATABASE, but an overall limit on the upper size of the total pool
>
> That already exists (in addition to the other parameters).

Hmm.  So what happens if you have 1000 databases with a minimum of 1
worker per database and an overall limit of 10 workers?

>> - I
>> can't see any reason to limit the minimum size of the pool, but I
>> might be missing something.
>
> I tried to mimic what others do, for example apache pre-fork. Maybe it's
> just another way of trying to keep the overall resource consumption at a
> reasonable level.
>
> The minimum is helpful to eliminate waits for backends starting up. Note
> here that the coordinator can only request to fork one new bgworker at a
> time. It then needs to wait until that new bgworker registers with the
> coordinator, until it can request further bgworkers from the postmaster.
> (That's due to the limitation in communication between the postmaster and
> coordinator).

Hmm, I see.  That's probably not helpful for autovacuum, but I can see
it being useful for replication.  I still think maybe we ought to try
to crack the nut of allowing backends to rebind to a different
database.  That would simplify things here a good deal, although then
again maybe it's too complex to be worth it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Hi,

On 09/15/2010 03:44 AM, Robert Haas wrote:
> Hmm.  So what happens if you have 1000 databases with a minimum of 1
> worker per database and an overall limit of 10 workers?

The first 10 databases would get an idle worker. As soon as real jobs 
arrive, the idle workers on databases that don't have any pending jobs 
get terminated in favor of the databases for which there are pending 
jobs. Admittedly, that mechanism isn't too clever, yet. I.e. if there 
always are enough jobs for one database, the others could starve.

With 1000 databases and a max of only 10 workers, chances for having a 
spare worker for the database that gets the next job are pretty low, 
yes. But that's the case with the proposed 5 minute timeout as well.

Lowering that timeout wouldn't increase the chance. And while it might 
make the start of a new bgworker quicker in the above mentioned case, I 
think there's not much advantage over just setting 
max_idle_background_workers = 0.

OTOH such a timeout would be easy enough to implement. The admin would 
be faced with yet another GUC, though.

> Hmm, I see.  That's probably not helpful for autovacuum, but I can see
> it being useful for replication.

Glad to hear.

> I still think maybe we ought to try
> to crack the nut of allowing backends to rebind to a different
> database.  That would simplify things here a good deal, although then
> again maybe it's too complex to be worth it.

Also note that it would re-introduce some of the costs we try to avoid 
with keeping the connected bgworker around. And in case you afford 
having at least a few spare bgworkers around per database (i.e. less 
than 10 or 20 databases), potential savings seem to be negligible again.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Wed, Sep 15, 2010 at 2:48 AM, Markus Wanner <markus@bluegap.ch> wrote:
>> Hmm.  So what happens if you have 1000 databases with a minimum of 1
>> worker per database and an overall limit of 10 workers?
>
> The first 10 databases would get an idle worker. As soon as real jobs
> arrive, the idle workers on databases that don't have any pending jobs get
> terminated in favor of the databases for which there are pending jobs.
> Admittedly, that mechanism isn't too clever, yet. I.e. if there always are
> enough jobs for one database, the others could starve.

I haven't scrutinized your code but it seems like the
minimum-per-database might be complicating things more than necessary.You might find that you can make the logic
simplerwithout that.  I 
might be wrong, though.

I guess the real issue here is whether it's possible to, and whether
you're interested in, extracting a committable subset of this work,
and if so what that subset should look like.  There's sort of a
chicken-and-egg problem with large patches; if you present them as one
giant monolithic patch, they're too large to review.  But if you break
them down into smaller patches, it doesn't really fix the problem
unless the pieces have independent value.  Even in the two years I've
been involved in the project, a number of different contributors have
gone through the experience of submitting a patch that only made sense
if you assumed that the follow-on patch was also going to get
accepted, and as no one was willing to assume that, the first patch
didn't get committed either.  Where people have been able to break
things down into a series of small to medium-sized incremental
improvements, things have gone more smoothly.  For example, Simon was
able to get a batch to start the bgwriter during archive recovery
committed to 8.4.  That didn't have a lot of independent value, but it
had some, and it paved the way for Hot Standby in 9.0.  Had someone
thought of a way to decompose that project into more than two truly
independent pieces, I suspect it might have even gone more smoothly
(although of course that's an arguable point and YMMV).

>> I still think maybe we ought to try
>> to crack the nut of allowing backends to rebind to a different
>> database.  That would simplify things here a good deal, although then
>> again maybe it's too complex to be worth it.
>
> Also note that it would re-introduce some of the costs we try to avoid with
> keeping the connected bgworker around.

How?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Robert,

On 09/15/2010 07:23 PM, Robert Haas wrote:
> I haven't scrutinized your code but it seems like the
> minimum-per-database might be complicating things more than necessary.
>   You might find that you can make the logic simpler without that.  I
> might be wrong, though.

I still think of that as a feature, not something I'd like to simplify 
away. Are you arguing it could raise the chance of adaption of bgworkers 
to Postgres? I'm not seeing your point here.

> I guess the real issue here is whether it's possible to, and whether
> you're interested in, extracting a committable subset of this work,
> and if so what that subset should look like.

Well, as it doesn't currently provide any real benefit for autovacuum, 
it depends on how much hackers like to prepare for something like 
Postgres-R or parallel querying.

> There's sort of a
> chicken-and-egg problem with large patches; if you present them as one
> giant monolithic patch, they're too large to review.  But if you break
> them down into smaller patches, it doesn't really fix the problem
> unless the pieces have independent value.

I don't quite get what you are trying to say here. I splited the 
bgworker projet from Postgres-R into 6 separate patches. Are you saying 
that's too few or too much?

You are welcome to argue about independent patches, i.e. this patch 1 of 
6 (as $SUBJECT implies) might have some value, according to Alvaro.

Admittedly, patch 2 of 6 is the biggest and most important chunk of 
functionality of the whole set.

>> Also note that it would re-introduce some of the costs we try to avoid with
>> keeping the connected bgworker around.
>
> How?

I'm talking about the cost of connecting to a database (and 
disconnecting), most probably flushing caches, and very probably some 
kind of re-registering with the coordinator. Most of what a normal 
backend does at startup. About the only thing you'd save here is the 
fork() and very basic process setup. I really doubt that's worth the effort.

Having some more idle processes around doesn't hurt that much and solves 
the problem, I think.

Thanks for your feedback.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Wed, Sep 15, 2010 at 2:28 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> I guess the real issue here is whether it's possible to, and whether
>> you're interested in, extracting a committable subset of this work,
>> and if so what that subset should look like.
>
> Well, as it doesn't currently provide any real benefit for autovacuum, it
> depends on how much hackers like to prepare for something like Postgres-R or
> parallel querying.

I think that the bar for committing to another in-core replication
solution right now is probably fairly high.  I am pretty doubtful that
our current architecture is going to get us to the full feature set
we'd eventually like to have - multi-master, partial replication, etc.But we're not ever going to have ten replication
solutionsin core, 
so we need to think pretty carefully about what we accept.  That
conversation probably needs to start from the other end - is the
overall architecture correct for us? - before we get down to specific
patches.  On the other hand, I'm very interested in laying the
groundwork for parallel query, and I think there are probably a number
of bits of architecture both from this project and Postgres-XC, that
could be valuable contributions to PostgreSQL; however, in neither
case do I expect them to be accepted without significant modification.

>> There's sort of a
>> chicken-and-egg problem with large patches; if you present them as one
>> giant monolithic patch, they're too large to review.  But if you break
>> them down into smaller patches, it doesn't really fix the problem
>> unless the pieces have independent value.
>
> I don't quite get what you are trying to say here. I splited the bgworker
> projet from Postgres-R into 6 separate patches. Are you saying that's too
> few or too much?

I'm saying it's hard to think about committing any of them because
they aren't really independent of each other or of other parts of
Postgres-R.

I feel like there is an antagonistic thread to this conversation, and
some others that we've had.  I hope I'm misreading that, because it's
not my intent to piss you off.  I'm just offering my honest feedback.
Your mileage may vary; others may feel differently; none of it is
personal.

>>> Also note that it would re-introduce some of the costs we try to avoid
>>> with
>>> keeping the connected bgworker around.
>>
>> How?
>
> I'm talking about the cost of connecting to a database (and disconnecting),
> most probably flushing caches, and very probably some kind of re-registering
> with the coordinator. Most of what a normal backend does at startup. About
> the only thing you'd save here is the fork() and very basic process setup. I
> really doubt that's worth the effort.
>
> Having some more idle processes around doesn't hurt that much and solves the
> problem, I think.

OK, I think I understand what you're trying to say now.  I guess I
feel like the ideal architecture for any sort of solution that needs a
pool of workers would be to keep around the workers that most recently
proved to be useful.  Upon needing a new worker, you look for one
that's available and already bound to the correct database.  If you
find one, you assign him to the new task.  If not, you find the one
that's been idle longest and either (a) kill him off and start a new
one that is bound to the correct database or, even better, (b) tell
him to flush his caches and rebind to the correct database.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Hi,

On 09/15/2010 08:54 PM, Robert Haas wrote:
> I think that the bar for committing to another in-core replication
> solution right now is probably fairly high.

I'm not trying to convince you to accept the Postgres-R patch.. at least 
not now.

<showing-off>
BTW, that'd be what I call a huge patch:

bgworkers, excluding dynshmem and imessages: 34 files changed, 2910 insertions(+), 1421 deletions(-)

from there to Postgres-R: 98 files changed, 14856 insertions(+), 230 deletions(-)
</showing-off>

> I am pretty doubtful that
> our current architecture is going to get us to the full feature set
> we'd eventually like to have - multi-master, partial replication, etc.

Would be hard to do, due to the (physical) format of WAL, yes. That's 
why Postgres-R uses its own (logical) wire format.

>   But we're not ever going to have ten replication solutions in core,
> so we need to think pretty carefully about what we accept.

That's very understandable.

> That
> conversation probably needs to start from the other end - is the
> overall architecture correct for us? - before we get down to specific
> patches.  On the other hand, I'm very interested in laying the
> groundwork for parallel query

Cool. Maybe we should take another look at bgworkers, as soon as a 
parallel querying feature gets planned?

> and I think there are probably a number
> of bits of architecture both from this project and Postgres-XC, that
> could be valuable contributions to PostgreSQL;

(...note that Postgres-R is license compatible, as opposed to the GPL'ed 
Postgres-XC project...)

> however, in neither
> case do I expect them to be accepted without significant modification.

Sure, that's understandable as well. I've published this part of the 
infrastructure to get some feedback as early as possible on that part of 
Postgres-R.

As you can certainly imagine, it's important for me that any 
modification to such a patch from Postgres-R would still be compatible 
to what I use it for in Postgres-R and not cripple any functionality 
there, because that'd probably create more work for me than not getting 
the patch accepted upstream at all.

> I'm saying it's hard to think about committing any of them because
> they aren't really independent of each other or of other parts of
> Postgres-R.

As long as you don't consider imessages and dynshmem a part of 
Postgres-R, they are independent of the rest of Postgres-R in the 
technical sense.

And for any kind of parallel querying feature, imessages and dynshmem 
might be of help as well. So I currently don't see where I could 
de-couple these patches any further.

If you have a specific requirement, please don't hesitate to ask.

> I feel like there is an antagonistic thread to this conversation, and
> some others that we've had.  I hope I'm misreading that, because it's
> not my intent to piss you off.  I'm just offering my honest feedback.
> Your mileage may vary; others may feel differently; none of it is
> personal.

That's absolutely fine. I'm thankful for your feedback.

Also note that I initially didn't even want to add the bgworker patches 
to the commit fest. I've de-coupled and published these separate from 
Postgres-R with a) the hope to get feedback (more than for the overall 
Postgres-R patch) and b) to show others that such a facility exists and 
is ready to be reused.

I didn't really expect them to get accepted to Postgres core at the 
moment. But the Postgres team normally asks for sharing concepts and 
ideas as early as possible...

> OK, I think I understand what you're trying to say now.  I guess I
> feel like the ideal architecture for any sort of solution that needs a
> pool of workers would be to keep around the workers that most recently
> proved to be useful.  Upon needing a new worker, you look for one
> that's available and already bound to the correct database.  If you
> find one, you assign him to the new task.

That's mostly how bgworkers are designed, yes. The min/max idle 
background worker GUCs allow a loose control over how many spare 
processes you want to allow hanging around doing nothing.

> If not, you find the one
> that's been idle longest and either (a) kill him off and start a new
> one that is bound to the correct database or, even better, (b) tell
> him to flush his caches and rebind to the correct database.

Hm.. sorry if I didn't express this more clearly. What I'm trying to say 
is that (b) isn't worth implementing, because it doesn't offer enough of 
an improvement over (a). The only saving would be the fork() and some 
basic process initialization.

Being able to re-use a bgworker connected to the correct database 
already gives you most of the benefit, namely not having to fork() *and* 
re-connect to the database for every job.


Back at the technical issues, let me try to summarize the feedback and 
what I do with it.

In general, there's not much use for bgworkers for just autovacuum as 
the only background job. I agree.

Tom raised the 'lots of databases' issue. I agree that the bgworker 
infrastructure isn't optimized for such a work load, but argue that it's 
configurable to not hurt. If bgworkers ever gets accepted upstream, we'd 
certainly need to discuss about reasonable defaults for the relevant 
GUCs. Additionally, more cleverness about when to start or stop (spare) 
workers from the coordinator couldn't hurt.

I had a lengthy discussion with Dimitri about whether or not bgworkers 
could help him with some kind of PgQ daemon. I think we now agree that 
bgworkers isn't the right tool for that job.

You are questioning, whether the min_idle_bgworkers GUC is really 
necessary. I'm arguing that it is necessary in Postgres-R to cover load 
spikes, because starting bgworkers is slow.


So, overall, I now got quite a bit of feedback. There doesn't seem to be 
any stumbling block in the general design of bgworkers. So I'll happily 
continue to use (and refine) bgworkers for Postgres-R. And I'm looking 
forward to more discussions once parallel querying gets more serious 
attention.

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Thu, Sep 16, 2010 at 4:47 AM, Markus Wanner <markus@bluegap.ch> wrote:
> <showing-off>
> BTW, that'd be what I call a huge patch:
>
> bgworkers, excluding dynshmem and imessages:
>  34 files changed, 2910 insertions(+), 1421 deletions(-)
>
> from there to Postgres-R:
>  98 files changed, 14856 insertions(+), 230 deletions(-)
> </showing-off>

Yeah, that's huge.  :-)

>> That
>> conversation probably needs to start from the other end - is the
>> overall architecture correct for us? - before we get down to specific
>> patches.  On the other hand, I'm very interested in laying the
>> groundwork for parallel query
>
> Cool. Maybe we should take another look at bgworkers, as soon as a parallel
> querying feature gets planned?

Well, that will obviously depend somewhat on the wishes of whoever
decides to work on parallel query, but it seems reasonable to me.  It
would be nice to get some pieces of this committed incrementally but
as I say I fear there is too much dependency on what might happen
later, at least the way things are structured now.

>> and I think there are probably a number
>> of bits of architecture both from this project and Postgres-XC, that
>> could be valuable contributions to PostgreSQL;
>
> (...note that Postgres-R is license compatible, as opposed to the GPL'ed
> Postgres-XC project...)

Yeah.  +1 for license compatibility.

> As you can certainly imagine, it's important for me that any modification to
> such a patch from Postgres-R would still be compatible to what I use it for
> in Postgres-R and not cripple any functionality there, because that'd
> probably create more work for me than not getting the patch accepted
> upstream at all.

That's an understandable goal, but it may be difficult to achieve.

> As long as you don't consider imessages and dynshmem a part of Postgres-R,
> they are independent of the rest of Postgres-R in the technical sense.
>
> And for any kind of parallel querying feature, imessages and dynshmem might
> be of help as well. So I currently don't see where I could de-couple these
> patches any further.

I agree.  I've already said my piece on how I think that stuff would
need to be reworked to be acceptable, so we might have to agree to
disagree on those, especially if your goal is to get something
committed that doesn't involve a major rewrite on your end.

> I didn't really expect them to get accepted to Postgres core at the moment.
> But the Postgres team normally asks for sharing concepts and ideas as early
> as possible...

Absolutely.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: patch 1 of 6 - permanent process

From
Markus Wanner
Date:
Morning,

On 09/16/2010 04:26 PM, Robert Haas wrote:
> I agree.  I've already said my piece on how I think that stuff would
> need to be reworked to be acceptable, so we might have to agree to
> disagree on those, especially if your goal is to get something
> committed that doesn't involve a major rewrite on your end.

Just for clarification: you are referring to imessages and dynshmem 
here, right? I agree that dynshmem needs to be reworked and rethought. 
And imessages simply depends on dynshmem.

If you are referring to the bgworker stuff, I'm not quite clear about 
what I could do to make bgworker more acceptable. (Except perhaps for 
removing the dependency on imessages).

Regards

Markus Wanner


Re: bg worker: patch 1 of 6 - permanent process

From
Robert Haas
Date:
On Thu, Sep 16, 2010 at 1:20 PM, Markus Wanner <markus@bluegap.ch> wrote:
> On 09/16/2010 04:26 PM, Robert Haas wrote:
>>
>> I agree.  I've already said my piece on how I think that stuff would
>> need to be reworked to be acceptable, so we might have to agree to
>> disagree on those, especially if your goal is to get something
>> committed that doesn't involve a major rewrite on your end.
>
> Just for clarification: you are referring to imessages and dynshmem here,
> right? I agree that dynshmem needs to be reworked and rethought. And
> imessages simply depends on dynshmem.

Yes, I was referring to imessages and dynshmem.

> If you are referring to the bgworker stuff, I'm not quite clear about what I
> could do to make bgworker more acceptable. (Except perhaps for removing the
> dependency on imessages).

I'm not sure, either.  It would be nice if there were a way to create
a general facility here that we could then build various applications
on, but I'm not sure whether that's the case.  We had some
back-and-forth about what is best for replication vs. what is best for
vacuum vs. what is best for parallel query.  If we could somehow
conceive of a system that could serve all of those needs without
introducing any more configuration complexity than what we have now,
that would of course be very interesting.  But judging by your
comments I'm not very sure such a thing is feasible, so perhaps
wait-and-see is the best approach.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: general purpose requirements

From
Markus Wanner
Date:
Hi,

On 09/16/2010 07:47 PM, Robert Haas wrote:
> It would be nice if there were a way to create
> a general facility here that we could then build various applications
> on, but I'm not sure whether that's the case.  We had some
> back-and-forth about what is best for replication vs. what is best for
> vacuum vs. what is best for parallel query.  If we could somehow
> conceive of a system that could serve all of those needs without
> introducing any more configuration complexity than what we have now,
> that would of course be very interesting.

Lets think about this again from a little distance. We have the existing 
autovacuum and the Postgres-R project. Then there are the potential 
features 'parallel querying' and 'autonomous transactions' that could in 
principle benefit from the bgworker infrastructure.

For all of those, one could head for a multi-threaded, a multi-process 
or an async, event based approach. Multi-threading seems to be out of 
question for Postgres. We don't have much of an async event framework 
anywhere, so at least for parallel querying that seems out of question 
as well. Only the 'autonomous transactions' feature seems simple enough 
to be doable within a single process. That approach would still miss the 
isolation that a separate process features (not sure that's required, 
but 'autonomous' sounds like it could be a good thing to have).

So assuming we use the multi-process approach provided by bgworkers for 
both potential features. What are the requirements?

autovacuum: only very few jobs at a time, not very resource intensive, 
not passing around lots of data

Postgres-R: lots of concurrent jobs, easily more than normal backends, 
depending on the amount of nodes in the cluster and read/write ratio, 
lots of data to be passed around

parallel querying: a couple dozen concurrent jobs (by number of CPUs or 
spindles available?), more doesn't help, lots of data to be passed around

autonomous transactions: max. one per normal backend (correct?), way 
fewer should suffice in most cases, only control data to be passed around


So, for both potential features as well as for autovacuum, a ratio of 
1:10 (or even less) for max_bgworkers:max_connections would suffice. 
Postgres-R clearly seems to be the out-breaker here. It needs special 
configuration anyway, so I'd have no problem with defaults that target 
the other use cases.

All of the potential users of bgworkers benefit from a pre-connected 
bgworker. Meaning having at least one spare bgworker around per database 
could be beneficial, potentially more depending on how often spike loads 
occur. As long as there are only few databases, it's easily possible to 
have at least one spare process around per database, but with thousands 
of databases, that might get prohibitively expensive (not sure where the 
boundary between win vs loose is, though. Idle backends vs. connection 
cost).

None the less, bgworkers would make the above features easier to 
implement, as they provide the controlled background worker process 
infrastructure, including job handling (and even queuing) in the 
coordinator process. Having spare workers available is not a perquisite 
to use bgworkers, it's just an optimization.

Autovacuum could possibly benefit from bgworkers by enabling a finer 
grained choice for what database and table to vacuum when. I didn't look 
too much into that, though.

Regarding the additional configuration overhead of the bgworkers patch: 
max_autovacuum_workers gets turned into max_background_workers, so the 
only additional GUCs currently are: min_spare_background_workers and 
max_spare_background_workers (sorry, I thought I named them idle 
workers, looks like I've gone with spare workers for the GUCs).

Those are used to control and limit (in both directions) the amount of 
spare workers (per database). It's the simplest possible variant I could 
think of. But I'm open to other mechanisms, especially ones that require 
less configuration. Simply keeping spare workers around for a given 
timeout *could* be a replacement and would save us one GUC.

However, I feel like this gives less control over how the bgworkers are 
used. For example, I'd prefer to be able to prevent the system from 
allocating all bgworkers to a single database at once. And as mentioned 
above, it also makes sense to pre-fork some bgworkers in advance, if 
there are still enough available. The timeout approach doesn't take care 
of that, but assumes that the past is a good indicator of use for the 
future.

Hope that sheds some more light on how bgworkers could be useful. Maybe 
I just need to describe the job handling features of the coordinator 
better as well? (Simon also requested better documentation...)

Regards

Markus Wanner


Re: bg worker: general purpose requirements

From
Robert Haas
Date:
On Fri, Sep 17, 2010 at 11:29 AM, Markus Wanner <markus@bluegap.ch> wrote:
> autonomous transactions: max. one per normal backend (correct?), way fewer
> should suffice in most cases, only control data to be passed around

Technically, you could start an autonomous transaction from within an
autonomous transaction, so I don't think there's a hard maximum of one
per normal backend.  However, I agree that the expected case is to not
have very many.

> All of the potential users of bgworkers benefit from a pre-connected
> bgworker. Meaning having at least one spare bgworker around per database
> could be beneficial, potentially more depending on how often spike loads
> occur. As long as there are only few databases, it's easily possible to have
> at least one spare process around per database, but with thousands of
> databases, that might get prohibitively expensive (not sure where the
> boundary between win vs loose is, though. Idle backends vs. connection
> cost).

I guess it depends on what your goals are.  If you're optimizing for
ability to respond quickly to a sudden load, keeping idle backends
will probably win even when the number of them you're keeping around
is fairly high.  If you're optimizing for minimal overall resource
consumption, though, you'll not be as happy about that.  What I'm
struggling to understand is this: if there aren't any preforked
workers around when the load hits, how much does it slow things down?
I would have thought that a few seconds to ramp up to speed after an
extended idle period (5 minutes, say) would be acceptable for most of
the applications you mention.  Is the ramp-up time longer than that,
or is even that much delay unacceptable for Postgres-R, or is there
some other aspect to the problem I'm failing to grasp?  I can tell you
have some experience tuning this so I'd like to try to understand
where you're coming from.

> However, I feel like this gives less control over how the bgworkers are
> used. For example, I'd prefer to be able to prevent the system from
> allocating all bgworkers to a single database at once.

I think this is an interesting example, and worth some further
thought.  I guess I don't really understand how Postgres-R uses these
bgworkers.  Are you replicating one transaction at a time, or how does
the data get sliced up?  I remember you mentioning
sync/async/eager/other replication strategies previously - do you have
a pointer to some good reading on that topic?

> Hope that sheds some more light on how bgworkers could be useful. Maybe I
> just need to describe the job handling features of the coordinator better as
> well? (Simon also requested better documentation...)

That seems like it would be useful, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: general purpose requirements

From
Markus Wanner
Date:
Robert,

On 09/17/2010 05:52 PM, Robert Haas wrote:
> Technically, you could start an autonomous transaction from within an
> autonomous transaction, so I don't think there's a hard maximum of one
> per normal backend.  However, I agree that the expected case is to not
> have very many.

Thanks for pointing that out. I somehow knew that was wrong..

> I guess it depends on what your goals are.

Agreed.

> If you're optimizing for
> ability to respond quickly to a sudden load, keeping idle backends
> will probably win even when the number of them you're keeping around
> is fairly high.  If you're optimizing for minimal overall resource
> consumption, though, you'll not be as happy about that.

What resources are we talking about here? Are idle backends really that 
resource hungry? My feeling so far has been that idle processes are 
relatively cheap (i.e. some 100 idle processes shouldn't hurt on a 
modern server).

> What I'm
> struggling to understand is this: if there aren't any preforked
> workers around when the load hits, how much does it slow things down?

As the startup code is pretty much the same as for the current 
avlauncher, the coordinator can only request one bgworker at a time.

This means the signal needs to reach the postmaster, which then forks a 
bgworker process. That new process starts up, connects to the requested 
database and then sends an imessage to the coordinator to register. Only 
after having received that registration, the coordinator can request 
another bgworker (note that this is a one-overall limitation, not per 
database).

I haven't measured the actual time it takes, but given the use case of a 
connection pool, I so far thought it's obvious that this process takes 
too long.

(It's exactly what apache pre-fork does, no? Is anybody concerned about 
the idle processes there? Or do they consume much less resources?)

> I would have thought that a few seconds to ramp up to speed after an
> extended idle period (5 minutes, say) would be acceptable for most of
> the applications you mention.

A few seconds? That might be sufficient for autovacuum, but most queries 
are completed in less that one second. So for parallel querying, 
autonomous transactions and Postgres-R, I certainly don't think that a 
few seconds are reasonable. Especially considering the cost of idle 
backends.

> Is the ramp-up time longer than that,
> or is even that much delay unacceptable for Postgres-R, or is there
> some other aspect to the problem I'm failing to grasp?  I can tell you
> have some experience tuning this so I'd like to try to understand
> where you're coming from.

I didn't ever compare to a max_spare_background_workers = 0 
configuration, so I don't have any hard numbers, sorry.

> I think this is an interesting example, and worth some further
> thought.  I guess I don't really understand how Postgres-R uses these
> bgworkers.

The given example doesn't only apply to Postgres-R. But with fewer 
bgworkers in total, you are more likely to want to use them all for one 
database, yes.

> Are you replicating one transaction at a time, or how does
> the data get sliced up?

Yes, one transaction at a time. One transaction per backend (bgworker). 
On a cluster with n nodes that has only performs writing transactions, 
avg. at a rate of m concurrent transactions/node, you ideally end up 
having m normal backends and (n-1) * m bgworkers that concurrently apply 
the remote transactions.

> I remember you mentioning
> sync/async/eager/other replication strategies previously - do you have
> a pointer to some good reading on that topic?

Postgres-R mainly is eager multi-master replication. www.postgres-r.org 
has some links, most up-to-date my concept paper:
http://www.postgres-r.org/downloads/concept.pdf

> That seems like it would be useful, too.

Okay, will try to come up with something, soon(ish).

Thank you for your feedback and constructive criticism.

Regards

Markus Wanner


Re: bg worker: general purpose requirements

From
Robert Haas
Date:
On Fri, Sep 17, 2010 at 4:49 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> If you're optimizing for
>> ability to respond quickly to a sudden load, keeping idle backends
>> will probably win even when the number of them you're keeping around
>> is fairly high.  If you're optimizing for minimal overall resource
>> consumption, though, you'll not be as happy about that.
>
> What resources are we talking about here? Are idle backends really that
> resource hungry? My feeling so far has been that idle processes are
> relatively cheap (i.e. some 100 idle processes shouldn't hurt on a modern
> server).

Wow, 100 processes??! Really?  I guess I don't actually know how large
modern proctables are, but on my MacOS X machine, for example, there
are only 75 processes showing up right now in "ps auxww".  My Fedora
12 machine has 97.  That's including a PostgreSQL instance in the
first case and an Apache instance in the second case.  So 100 workers
seems like a ton to me.

>> What I'm
>> struggling to understand is this: if there aren't any preforked
>> workers around when the load hits, how much does it slow things down?
>
> As the startup code is pretty much the same as for the current avlauncher,
> the coordinator can only request one bgworker at a time.
>
> This means the signal needs to reach the postmaster, which then forks a
> bgworker process. That new process starts up, connects to the requested
> database and then sends an imessage to the coordinator to register. Only
> after having received that registration, the coordinator can request another
> bgworker (note that this is a one-overall limitation, not per database).
>
> I haven't measured the actual time it takes, but given the use case of a
> connection pool, I so far thought it's obvious that this process takes too
> long.

Maybe that would be a worthwhile exercise...

> (It's exactly what apache pre-fork does, no? Is anybody concerned about the
> idle processes there? Or do they consume much less resources?)

I think the kicker here is the idea of having a certain number of
extra workers per database.  On my vanilla Apache server on the
above-mentioned Fedora 12 VM, there are a total of 10 processes
running.  I am sure that could balloon to 100 or more under load, but
it's not keeping around 100 processes on an otherwise idle system.  So
if you knew you only had 1 database, keeping around 2 or 3 or 5 or
even 10 workers might seem reasonable, but since you might have 1
database or 1000 databases, it doesn't.  Keeping 2 or 3 or 5 or 10
workers TOTAL around could be reasonable, but not per-database.  As
Tom said upthread, we don't want to assume that we're the only thing
running on the box and are therefore entitled to take up all the
available memory/disk/process slots/whatever.  And even if we DID feel
so entitled, there could be hundreds of databases, and it certainly
doesn't seem practical to keep 1000 workers around "just in case".

I don't know whether an idle Apache worker consumes more or less
memory than an idle PostgreSQL worker, but another difference between
the Apache case and the PostgreSQL case is that presumably all those
backend processes have attached shared memory and have ProcArray
slots.  We know that code doesn't scale terribly well, especially in
terms of taking snapshots, and that's one reason why high-volume
PostgreSQL installations pretty much require a connection pooler.  I
think the sizes of the connection pools I've seen recommended are
considerably smaller than 100, more like 2 * CPUs + spindles, or
something like that.  It seems like if you actually used all 100
workers at the same time performance might be pretty awful.

I was taking a look at the Mammoth Replicator code this week
(parenthetical note: I couldn't figure out where mcp_server was or how
to set it up) and it apparently has a limitation that only one
database in the cluster can be replicated.  I'm a little fuzzy on how
Mammoth works, but apparently this problem of scaling to large numbers
of databases is not unique to Postgres-R.

>> Is the ramp-up time longer than that,
>> or is even that much delay unacceptable for Postgres-R, or is there
>> some other aspect to the problem I'm failing to grasp?  I can tell you
>> have some experience tuning this so I'd like to try to understand
>> where you're coming from.
>
> I didn't ever compare to a max_spare_background_workers = 0 configuration,
> so I don't have any hard numbers, sorry.

Hmm, OK.

>> I think this is an interesting example, and worth some further
>> thought.  I guess I don't really understand how Postgres-R uses these
>> bgworkers.
>
> The given example doesn't only apply to Postgres-R. But with fewer bgworkers
> in total, you are more likely to want to use them all for one database, yes.
>
>> Are you replicating one transaction at a time, or how does
>> the data get sliced up?
>
> Yes, one transaction at a time. One transaction per backend (bgworker). On a
> cluster with n nodes that has only performs writing transactions, avg. at a
> rate of m concurrent transactions/node, you ideally end up having m normal
> backends and (n-1) * m bgworkers that concurrently apply the remote
> transactions.

What is the granularity of replication?  Per-database?  Per-table?
How do you accumulate the change sets?  Some kind of bespoke hook, WAL
scanning, ...?

> Thank you for your feedback and constructive criticism.

My pleasure.  Interesting stuff.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: general purpose requirements

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Wow, 100 processes??! Really?  I guess I don't actually know how large
> modern proctables are, but on my MacOS X machine, for example, there
> are only 75 processes showing up right now in "ps auxww".  My Fedora
> 12 machine has 97.  That's including a PostgreSQL instance in the
> first case and an Apache instance in the second case.  So 100 workers
> seems like a ton to me.

The part of that that would worry me is open files.  PG backends don't
have any compunction about holding open hundreds of files.  Apiece.
You can dial that down but it'll cost you performance-wise.  Last
I checked, most Unix kernels still had limited-size FD arrays.

And as you say, ProcArray manipulations aren't going to be terribly
happy about large numbers of idle backends, either.
        regards, tom lane


Re: bg worker: general purpose requirements

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, Sep 17, 2010 at 11:21:13PM -0400, Robert Haas wrote:

[...]

> Wow, 100 processes??! Really?  I guess I don't actually know how large
> modern proctables are, but on my MacOS X machine, for example, there
> are only 75 processes showing up right now in "ps auxww".  My Fedora
> 12 machine has 97.  That's including a PostgreSQL instance in the
> first case and an Apache instance in the second case.  So 100 workers
> seems like a ton to me.

As an equally unscientific data point, on my box, a typical desktop box
(actually a netbook, slow CPU, but beefed up to 2GB RAM), I have 5
PostgreSQL processes running, which take away about 1.2 MB (resident) --
not each one, but together!. As a contrast, there is *one* mysql daemon
(don't ask!), taking away 17 MB. The worst offenders are, by far, the
eye-candy thingies, as one has become accustomed to expect :-(

What I wanted to say is that the PostgreSQL processes are unusually
light-weight by modern standards.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFMlFEhBcgs9XrR2kYRAlqHAJ9rz5eQhqnh62H5QljDjU0E68ai6wCffnCW
ybV0RIdDy769/JYBBq7xakA=
=7Vc/
-----END PGP SIGNATURE-----


Re: bg worker: general purpose requirements

From
Markus Wanner
Date:
On 09/18/2010 05:43 AM, Tom Lane wrote:
> The part of that that would worry me is open files.  PG backends don't
> have any compunction about holding open hundreds of files.  Apiece.
> You can dial that down but it'll cost you performance-wise.  Last
> I checked, most Unix kernels still had limited-size FD arrays.

Thank you very much, that's a helpful hint.

I did some quick testing and managed to fork up to around 2000 backends,
at which point my (laptop) system got unresponsive. To be honest, that's
really surprising me.

(I had to increased the SHM and SEM kernel limits to be able to start
Postgres with that many processes at all. Obviously, Linux doesn't seem
to like that... on a second test I got a kernel panic)

> And as you say, ProcArray manipulations aren't going to be terribly
> happy about large numbers of idle backends, either.

Very understandable, yes.

Regards

Markus Wanner


Re: bg worker: general purpose requirements

From
Markus Wanner
Date:
Hi,

On 09/18/2010 05:21 AM, Robert Haas wrote:
> Wow, 100 processes??! Really?  I guess I don't actually know how large
> modern proctables are, but on my MacOS X machine, for example, there
> are only 75 processes showing up right now in "ps auxww".  My Fedora
> 12 machine has 97.  That's including a PostgreSQL instance in the
> first case and an Apache instance in the second case.  So 100 workers
> seems like a ton to me.

Well, Apache pre-forks 5 processes in total (by default, that is, for
high volume webservers a higher MinSpareServers setting is certainly not
out of question). While bgworkers currently needs to fork
min_spare_background_workers processes per database.

AIUI, that's the main problem with the current architecture.

>> I haven't measured the actual time it takes, but given the use case of a
>> connection pool, I so far thought it's obvious that this process takes too
>> long.
> 
> Maybe that would be a worthwhile exercise...

On my laptop I'm measuring around 18 bgworker starts per second, i.e.
roughly 50 ms per bgworker start. That's certainly just a ball-park figure..

One could parallelize the communication channel between the coordinator
and postmaster, so as to be able to start multiple bgworkers in
parallel, but the initial latency remains.

It's certainly quick enough for autovacuum. But equally certainly not
acceptable for Postgres-R, where latency is the worst enemy in the first
place.

For autonomous transactions and parallel querying, I'd also say that I'd
rather not like to have such a latency.

> I think the kicker here is the idea of having a certain number of
> extra workers per database.

Agreed, but I don't see any better way. Short of a re-connecting feature.

> So
> if you knew you only had 1 database, keeping around 2 or 3 or 5 or
> even 10 workers might seem reasonable, but since you might have 1
> database or 1000 databases, it doesn't.  Keeping 2 or 3 or 5 or 10
> workers TOTAL around could be reasonable, but not per-database.  As
> Tom said upthread, we don't want to assume that we're the only thing
> running on the box and are therefore entitled to take up all the
> available memory/disk/process slots/whatever.  And even if we DID feel
> so entitled, there could be hundreds of databases, and it certainly
> doesn't seem practical to keep 1000 workers around "just in case".

Agreed. Looks like Postgres-R has a slightly different focus, because if
you need multi-master replication, you probably don't have 1000s of
databases and/or lots of other services on the same machine.

> I don't know whether an idle Apache worker consumes more or less
> memory than an idle PostgreSQL worker, but another difference between
> the Apache case and the PostgreSQL case is that presumably all those
> backend processes have attached shared memory and have ProcArray
> slots.  We know that code doesn't scale terribly well, especially in
> terms of taking snapshots, and that's one reason why high-volume
> PostgreSQL installations pretty much require a connection pooler.  I
> think the sizes of the connection pools I've seen recommended are
> considerably smaller than 100, more like 2 * CPUs + spindles, or
> something like that.  It seems like if you actually used all 100
> workers at the same time performance might be pretty awful.

Sounds reasonable, yes.

> I was taking a look at the Mammoth Replicator code this week
> (parenthetical note: I couldn't figure out where mcp_server was or how
> to set it up) and it apparently has a limitation that only one
> database in the cluster can be replicated.  I'm a little fuzzy on how
> Mammoth works, but apparently this problem of scaling to large numbers
> of databases is not unique to Postgres-R.

Postgres-R is able to replicate multiple databases. Maybe not thousands,
but still designed for it.

> What is the granularity of replication?  Per-database?  Per-table?

Currently per-cluster (i.e. all your databases at once).

> How do you accumulate the change sets?

Logical changes get collected at the heapam level. They get serialized
and streamed (via imessages and a group communication system) to all
nodes. Application of change sets is highly parallelized and should be
pretty efficient. Commit ordering is decided by the GCS to guarantee
consistency across all nodes, conflicts get resolved by aborting the
later transaction.

> Some kind of bespoke hook, WAL scanning, ...?

No hooks, please!  ;-)

Regards

Markus Wanner


Re: bg worker: general purpose requirements

From
Robert Haas
Date:
On Mon, Sep 20, 2010 at 11:30 AM, Markus Wanner <markus@bluegap.ch> wrote:
> Well, Apache pre-forks 5 processes in total (by default, that is, for
> high volume webservers a higher MinSpareServers setting is certainly not
> out of question). While bgworkers currently needs to fork
> min_spare_background_workers processes per database.
>
> AIUI, that's the main problem with the current architecture.

Assuming that "the main problem" refers more or less to the words "per
database", I agree.

>>> I haven't measured the actual time it takes, but given the use case of a
>>> connection pool, I so far thought it's obvious that this process takes too
>>> long.
>>
>> Maybe that would be a worthwhile exercise...
>
> On my laptop I'm measuring around 18 bgworker starts per second, i.e.
> roughly 50 ms per bgworker start. That's certainly just a ball-park figure..

Gee, that doesn't seem slow enough to worry about to me.  If we
suppose that you need 2 * CPUs + spindles processes to fully load the
system, that means you should be able to ramp up from zero to
consuming every available system resource in under a second; except
perhaps on a system with a huge RAID array, which might need 2 or 3
seconds.  If you parallelize the worker startup, as you suggest, I'd
think you could knock quite a bit more off of this, but why all the
worry about startup latency?  Once the system is chugging along, none
of this should matter very much, I would think.  If you need to
repeatedly kill off some workers bound to one database and start some
new ones to bind to a different database, that could be sorta painful,
but if you can actually afford to keep around the workers for all the
databases you care about, it seems fine.

>> How do you accumulate the change sets?
>
> Logical changes get collected at the heapam level. They get serialized
> and streamed (via imessages and a group communication system) to all
> nodes. Application of change sets is highly parallelized and should be
> pretty efficient. Commit ordering is decided by the GCS to guarantee
> consistency across all nodes, conflicts get resolved by aborting the
> later transaction.

Neat stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: general purpose requirements

From
Markus Wanner
Date:
Robert,

On 09/20/2010 06:57 PM, Robert Haas wrote:
> Gee, that doesn't seem slow enough to worry about to me.  If we
> suppose that you need 2 * CPUs + spindles processes to fully load the
> system, that means you should be able to ramp up from zero to
> consuming every available system resource in under a second; except
> perhaps on a system with a huge RAID array, which might need 2 or 3
> seconds.  If you parallelize the worker startup, as you suggest, I'd
> think you could knock quite a bit more off of this, but why all the
> worry about startup latency?  Once the system is chugging along, none
> of this should matter very much, I would think.  If you need to
> repeatedly kill off some workers bound to one database and start some
> new ones to bind to a different database, that could be sorta painful,
> but if you can actually afford to keep around the workers for all the
> databases you care about, it seems fine.

Hm.. I see. So in other words, you are saying
min_spare_background_workers isn't flexible enough in case one has
thousands of databases but only uses a few of them frequently.

I understand that reasoning and the wish to keep the number of GUCs as
low as possible. I'll try to drop the min_spare_background_workers from
the bgworker patches.

The rest of the bgworker infrastructure should behave pretty much like
what you have described. Parallelism in starting bgworkers could be a
nice improvement, especially if we kill the min_space_background_workers
mechanism.

> Neat stuff.

Thanks.

Markus Wanner


Re: bg worker: general purpose requirements

From
Robert Haas
Date:
On Mon, Sep 20, 2010 at 1:45 PM, Markus Wanner <markus@bluegap.ch> wrote:
> Hm.. I see. So in other words, you are saying
> min_spare_background_workers isn't flexible enough in case one has
> thousands of databases but only uses a few of them frequently.

Yes, I think that is true.

> I understand that reasoning and the wish to keep the number of GUCs as
> low as possible. I'll try to drop the min_spare_background_workers from
> the bgworker patches.

OK.  At least for me, what is important is not only how many GUCs
there are but how likely they are to require tuning and how easy it
will be to know what the appropriate value is.  It seems fairly easy
to tune the maximum number of background workers, and it doesn't seem
hard to tune an idle timeout, either.  Both of those are pretty
straightforward trade-offs between, on the one hand, consuming more
system resources, and on the other hand, better throughput and/or
latency.  On the other hand, the minimum number of workers to keep
around per-database seems hard to tune.  If performance is bad, do I
raise it or lower it?  And it's certainly not really a hard minimum
because it necessarily bumps up against the limit on overall number of
workers if the number of databases grows too large; one or the other
has to give.

I think we need to look for a way to eliminate the maximum number of
workers per database, too.  Your previous point about not wanting one
database to gobble up all the available slots makes sense, but again,
it's not obvious how to set this sensibly.  If 99% of your activity is
in one database, you might want to use all the slots for that
database, at least until there's something to do in some other
database.  I feel like the right thing here is for the number of
workers for any given database to fluctuate in some natural way that
is based on the workload.  If one database has all the activity, it
gets all the slots, at least until somebody else needs them.  Of
course, you need to design the algorithm so as to avoid starvation...

> The rest of the bgworker infrastructure should behave pretty much like
> what you have described. Parallelism in starting bgworkers could be a
> nice improvement, especially if we kill the min_space_background_workers
> mechanism.

Works for me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: general purpose requirements

From
Markus Wanner
Date:
On 09/21/2010 02:49 AM, Robert Haas wrote:
> OK.  At least for me, what is important is not only how many GUCs
> there are but how likely they are to require tuning and how easy it
> will be to know what the appropriate value is.  It seems fairly easy
> to tune the maximum number of background workers, and it doesn't seem
> hard to tune an idle timeout, either.  Both of those are pretty
> straightforward trade-offs between, on the one hand, consuming more
> system resources, and on the other hand, better throughput and/or
> latency.

Hm.. I thought of it the other way around. It's more obvious and direct
for me to determine a min and max of the amount of parallel jobs I want
to perform at once. Based on the number of spindles, CPUs and/or nodes
in the cluster (in case of Postgres-R). Admittedly, not necessarily per
database, but at least overall.

I wouldn't known what to set a timeout to. And you didn't make a good
argument for any specific value so far. Nor did you offer a reasoning
for how to find one. It's certainly very workload and feature specific.

> On the other hand, the minimum number of workers to keep
> around per-database seems hard to tune.  If performance is bad, do I
> raise it or lower it?

Same applies for the timeout value.

> And it's certainly not really a hard minimum
> because it necessarily bumps up against the limit on overall number of
> workers if the number of databases grows too large; one or the other
> has to give.

I'd consider the case of min_spare_background_workers * number of
databases > max_background_workers to be a configuration error, about
which the coordinator should warn.

> I think we need to look for a way to eliminate the maximum number of
> workers per database, too.

Okay, might make sense, yes.

Dropping both of these per-database GUCs, we'd simply end up with having
max_background_workers around all the time.

A timeout would mainly help to limit the max amount of time workers sit
around idle. I fail to see how that's more helpful than the proposed
min/max. Quite the opposite, it's impossible to get any useful guarantees.

It assumes that the workload remains the same over time, but doesn't
cope well with sudden spikes and changes in the workload. Unlike the
proposed min/max combination, which forks new bgworkers in advance, even
if the database already uses lots of them. And after the spike, it
quickly reduces the amount of spare bgworkers to a certain max. While
not perfect, it's definitely more adaptive to the workload (at least in
the usual case of having only few databases).

Maybe we need a more sophisticated algorithm in the coordinator. For
example measuring the avg. amount of concurrent jobs per database over
time and adjust the number of idle backends according to that, the
current workload and the max_background_workers, or some such. The
min/max GUCs were simply easier to implement, but I'm open to a more
sophisticated thing.

Regards

Markus Wanner


Re: bg worker: general purpose requirements

From
Robert Haas
Date:
On Tue, Sep 21, 2010 at 4:23 AM, Markus Wanner <markus@bluegap.ch> wrote:
> On 09/21/2010 02:49 AM, Robert Haas wrote:
>> OK.  At least for me, what is important is not only how many GUCs
>> there are but how likely they are to require tuning and how easy it
>> will be to know what the appropriate value is.  It seems fairly easy
>> to tune the maximum number of background workers, and it doesn't seem
>> hard to tune an idle timeout, either.  Both of those are pretty
>> straightforward trade-offs between, on the one hand, consuming more
>> system resources, and on the other hand, better throughput and/or
>> latency.
>
> Hm.. I thought of it the other way around. It's more obvious and direct
> for me to determine a min and max of the amount of parallel jobs I want
> to perform at once. Based on the number of spindles, CPUs and/or nodes
> in the cluster (in case of Postgres-R). Admittedly, not necessarily per
> database, but at least overall.

Wait, are we in violent agreement here?  An overall limit on the
number of parallel jobs is exactly what I think *does* make sense.
It's the other knobs I find odd.

> I wouldn't known what to set a timeout to. And you didn't make a good
> argument for any specific value so far. Nor did you offer a reasoning
> for how to find one. It's certainly very workload and feature specific.

I think my basic contention is that it doesn't matter very much, so
any reasonable value should be fine.  I think 5 minutes will be good
enough for 99% of cases.  But if you find that this leaves too many
extra backends around and you start to run out of file descriptors or
your ProcArray gets too full, then you might want to drop it down.
Conversely, if you want to fine-tune your system for sudden load
spikes, you could raise it.

> I'd consider the case of min_spare_background_workers * number of
> databases > max_background_workers to be a configuration error, about
> which the coordinator should warn.

The number of databases isn't a configuration parameter.  Ideally,
users shouldn't have to reconfigure the system because they create
more databases.

>> I think we need to look for a way to eliminate the maximum number of
>> workers per database, too.
>
> Okay, might make sense, yes.
>
> Dropping both of these per-database GUCs, we'd simply end up with having
> max_background_workers around all the time.
>
> A timeout would mainly help to limit the max amount of time workers sit
> around idle. I fail to see how that's more helpful than the proposed
> min/max. Quite the opposite, it's impossible to get any useful guarantees.
>
> It assumes that the workload remains the same over time, but doesn't
> cope well with sudden spikes and changes in the workload.

I guess we differ on the meaning of "cope well"...  being able to spin
up 18 workers in one second seems very fast to me.  How many do you
expect to ever need?!!

> Unlike the
> proposed min/max combination, which forks new bgworkers in advance, even
> if the database already uses lots of them. And after the spike, it
> quickly reduces the amount of spare bgworkers to a certain max. While
> not perfect, it's definitely more adaptive to the workload (at least in
> the usual case of having only few databases).
>
> Maybe we need a more sophisticated algorithm in the coordinator. For
> example measuring the avg. amount of concurrent jobs per database over
> time and adjust the number of idle backends according to that, the
> current workload and the max_background_workers, or some such. The
> min/max GUCs were simply easier to implement, but I'm open to a more
> sophisticated thing.

Possibly, but I'm still having a hard time understanding why you need
all the complexity you already have.  The way I'd imagine doing this
is:

1. If a new job arrives, and there is an idle worker available for the
correct database, then allocate that worker to that job.  Stop.
2. Otherwise, if the number of background workers is less than the
maximum number allowable, then start a new worker for the appropriate
database and allocate it to the new job.  Stop.
3. Otherwise, if there is at least one idle background worker, kill it
and start a new one for the correct database.  Allocate that new
worker to the new job.  Stop.
4. Otherwise, you're already at the maximum number of background
workers and they're all busy.  Wait until some worker finishes a job,
and then try again beginning with step 1.

When a worker finishes a job, it hangs around for a few minutes to see
if it gets assigned a new job (as per #1) and then exits.

Although there are other tunables that can be exposed, I would expect,
in this design, that the only thing most people would need to adjust
would be the maximum pool size.

It seems (to me) like your design is being driven by start-up latency,
which I just don't understand.  Sure, 50 ms to start up a worker isn't
fantastic, but the idea is that it won't happen much because there
will probably already be a worker in that database from previous
activity.  The only exception is when there's a sudden surge of
activity.  But I don't think that's the case to optimize for.  If a
database hasn't had any activity in a while, I think it's better to
reclaim the memory and file descriptors and ProcArray slots that we're
spending on it so that the rest of the system can run faster.  If that
means it takes an extra fraction of a second to respond at some later
point, I can live with that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: general purpose requirements

From
Markus Wanner
Date:
On 09/21/2010 03:46 PM, Robert Haas wrote:
> Wait, are we in violent agreement here?  An overall limit on the
> number of parallel jobs is exactly what I think *does* make sense.
> It's the other knobs I find odd.

Note that the max setting I've been talking about here is the maximum
amount of *idle* workers allowed. It does not include busy bgworkers.

> I guess we differ on the meaning of "cope well"...  being able to spin
> up 18 workers in one second seems very fast to me.  

Well, it's obviously use case dependent. For Postgres-R (and sync
replication) in general, people are very sensitive to latency. There's
the network latency already, but adding a 50ms latency for no good
reason is not going to make these people happy.

> How many do you expect to ever need?!!

Again, very different. For Postgres-R, easily a couple of dozens. Same
applies for parallel querying when having multiple concurrent parallel
queries.

> Possibly, but I'm still having a hard time understanding why you need
> all the complexity you already have.

To make sure I we only pay the startup cost in very rare occasions, and
not every time the workload changes a bit (or isn't in conformance with
an arbitrary timeout).

(BTW the min/max is hardly any more complex than a timeout. It doesn't
even need a syscall).

> It seems (to me) like your design is being driven by start-up latency,
> which I just don't understand.  Sure, 50 ms to start up a worker isn't
> fantastic, but the idea is that it won't happen much because there
> will probably already be a worker in that database from previous
> activity.  The only exception is when there's a sudden surge of
> activity.

I'm less optimistic about the consistency of the workload.

> But I don't think that's the case to optimize for.  If a
> database hasn't had any activity in a while, I think it's better to
> reclaim the memory and file descriptors and ProcArray slots that we're
> spending on it so that the rest of the system can run faster.

Absolutely. It's what I call a change in workload. The min/max approach
is certainly faster at reclaiming unused workers, but (depending on the
max setting) doesn't necessarily ever go down to zero.

Regards

Markus Wanner


Re: bg worker: general purpose requirements

From
Robert Haas
Date:
On Tue, Sep 21, 2010 at 11:31 AM, Markus Wanner <markus@bluegap.ch> wrote:
> On 09/21/2010 03:46 PM, Robert Haas wrote:
>> Wait, are we in violent agreement here?  An overall limit on the
>> number of parallel jobs is exactly what I think *does* make sense.
>> It's the other knobs I find odd.
>
> Note that the max setting I've been talking about here is the maximum
> amount of *idle* workers allowed. It does not include busy bgworkers.

Oh, wow.  Is there another limit on the total number of bgworkers?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: bg worker: general purpose requirements

From
Markus Wanner
Date:
On 09/21/2010 05:59 PM, Robert Haas wrote:
> Oh, wow.  Is there another limit on the total number of bgworkers?

There currently are three GUCs that control bgworkers:

max_background_workers
min_spare_background_workers
max_spare_background_workers

The first replaces the former autovacuum_max_workers GUC. As before, it
is an overall limit, much like max_connections.

The later two are additional. They are per-database lower and upper
limits for the amount of idle workers an any point in time. These later
two are what I'm referring to as the min/max approach. And what I'm
arguing cannot be replaced by a timeout without loosing functionality.

Regards

Markus Wanner


Re: bg worker: general purpose requirements

From
Greg Stark
Date:
On Sat, Sep 18, 2010 at 4:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> (It's exactly what apache pre-fork does, no? Is anybody concerned about the
>> idle processes there? Or do they consume much less resources?)
....
>
> I don't know whether an idle Apache worker consumes more or less
> memory than an idle PostgreSQL worker, but another difference between
> the Apache case and the PostgreSQL case is that presumably all those

Apache, like Postgres, handles a lot of different use cases and the
ideal configuration depends heavily on how you use it. The default
configs for Apache are meant to be flexible and handle a mixed
workload where some requests are heavyweight scripts which might have
waits on a database and others are lightweight requests for static
objects. This is a hard configuration to get right but the default is
to ramp up the number of processes dynamically in the hopes of
reaching some kind of equilibrium.

Generally the recommended approach for a high traffic site is to use a
dedicated Apache or thttpd or equivalent install for the static
objects -- this one would have hundreds of workers or threads or
whatever and each one would be fairly lightweight. In fact nearly all
the RAM can be shared and the overhead of forking a new process would
be too high compared to serving static content from cache to let the
number scale dynamically. If you have 200 processes each of which has
only a few kB of private RAM then nearly all the RAM is available for
filesystem cache and requests can be served in milliseconds (mostly
network latency).

Then the heavyweight scripts can run on a dedicated Apache install
where the total number of processes is limited to something sane like
a small multiple of the number of cores -- basically RAM / ram
required to run the interpreter. If you have 20 processes and each
uses a 40 MB then your 2GB machine has about half its RAM available
for filesystem cache or other uses. Again you want to run with the 20
processes always running -- this time because the interpreter startup
is usually quite slow.

The dynamic ramp-up is a feature to deal for the default install and
for use case where the system has lots of different users with
different needs.


-- 
greg


Re: bg worker: general purpose requirements

From
Markus Wanner
Date:
Greg,

On 09/25/2010 08:03 PM, Greg Stark wrote:
> The dynamic ramp-up is a feature to deal for the default install and
> for use case where the system has lots of different users with
> different needs.

Thanks for sharing this. From that perspective, neither the current
min/max nor the timeout configuration approach would be satisfying, as
it's not really possible to configure it to always have a certain amount
of bgworkers (exactly adjusted to the requirements at hand, with
possibly differing requirements per database).

I'm unsure about how to continue here. It seems we need the ability to
switch between databases not (only) for performance, but for ease of
configuration.

Regards

Markus Wanner