Thread: profiling connection overhead
Per previous threats, I spent some time tonight running oprofile (using the directions Tom Lane was foolish enough to provide me back in May). I took testlibpq.c and hacked it up to just connect to the server and then disconnect in a tight loop without doing anything useful, hoping to measure the overhead of starting up a new connection. Ha, ha, funny about that: 120899 18.0616 postgres AtProcExit_Buffers 56891 8.4992 libc-2.11.2.so memset 30987 4.6293 libc-2.11.2.so memcpy 26944 4.0253 postgres hash_search_with_hash_value 26554 3.9670 postgres AllocSetAlloc 20407 3.0487 libc-2.11.2.so _int_malloc 17269 2.5799 libc-2.11.2.so fread 13005 1.9429 ld-2.11.2.so do_lookup_x 11850 1.7703 ld-2.11.2.so _dl_fixup 10194 1.5229 libc-2.11.2.so _IO_file_xsgetn In English: the #1 overhead here is actually something that happens when processes EXIT, not when they start. Essentially all the time is in two lines: 56920 6.6006 : for (i = 0; i < NBuffers; i++) : {98745 11.4507 : if (PrivateRefCount[i]!= 0) Non-default configs: max_connections = 100 shared_buffers = 480MB work_mem = 4MB maintenance_work_mem = 128MB checkpoint_segments = 30 random_page_cost = 2.0 Anything we can do about this? That's a lot of overhead, and it'd be a lot worse on a big machine with 8GB of shared_buffers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 24, 2010 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Per previous threats, I spent some time tonight running oprofile > (using the directions Tom Lane was foolish enough to provide me back > in May). I took testlibpq.c and hacked it up to just connect to the > server and then disconnect in a tight loop without doing anything > useful, hoping to measure the overhead of starting up a new > connection. Oh, right: attachments. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 24.11.2010 07:07, Robert Haas wrote: > Per previous threats, I spent some time tonight running oprofile > (using the directions Tom Lane was foolish enough to provide me back > in May). I took testlibpq.c and hacked it up to just connect to the > server and then disconnect in a tight loop without doing anything > useful, hoping to measure the overhead of starting up a new > connection. Ha, ha, funny about that: > > 120899 18.0616 postgres AtProcExit_Buffers > 56891 8.4992 libc-2.11.2.so memset > 30987 4.6293 libc-2.11.2.so memcpy > 26944 4.0253 postgres hash_search_with_hash_value > 26554 3.9670 postgres AllocSetAlloc > 20407 3.0487 libc-2.11.2.so _int_malloc > 17269 2.5799 libc-2.11.2.so fread > 13005 1.9429 ld-2.11.2.so do_lookup_x > 11850 1.7703 ld-2.11.2.so _dl_fixup > 10194 1.5229 libc-2.11.2.so _IO_file_xsgetn > > In English: the #1 overhead here is actually something that happens > when processes EXIT, not when they start. Essentially all the time is > in two lines: > > 56920 6.6006 : for (i = 0; i< NBuffers; i++) > : { > 98745 11.4507 : if (PrivateRefCount[i] != 0) Oh, that's quite surprising. > Anything we can do about this? That's a lot of overhead, and it'd be > a lot worse on a big machine with 8GB of shared_buffers. Micro-optimizing that search for the non-zero value helps a little bit (attached). Reduces the percentage shown by oprofile from about 16% to 12% on my laptop. For bigger gains, I think you need to somehow make the PrivateRefCount smaller. Perhaps only use one byte for each buffer instead of int32, and use some sort of an overflow list for the rare case that a buffer is pinned more than 255 times. Or make it a hash table instead of a simple lookup array. But whatever you do, you have to be very careful to not add overhead to PinBuffer/UnPinBuffer, those can already be quite high in oprofile reports of real applications. It might be worth experimenting a bit, at the moment PrivateRefCount takes up 5MB of memory per 1GB of shared_buffers. Machines with a high shared_buffers setting have no shortage of memory, but a large array like that might waste a lot of precious CPU cache. Now, the other question is if this really matters. Even if we eliminate that loop in AtProcExit_Buffers altogether, is connect/disconnect still be so slow that you have to use a connection pooler if you do that a lot? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On Wed, Nov 24, 2010 at 2:10 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> Anything we can do about this? That's a lot of overhead, and it'd be >> a lot worse on a big machine with 8GB of shared_buffers. > > Micro-optimizing that search for the non-zero value helps a little bit > (attached). Reduces the percentage shown by oprofile from about 16% to 12% > on my laptop. > > For bigger gains, The first optimization that occurred to me was "remove the loop altogether". I could maybe see needing to do something like this if we're recovering from an error, but why do we need to do this (except perhaps to fail an assertion) if we're exiting cleanly? Even a session-lifetime buffer-pin leak would be quite disastrous, one would think. > Now, the other question is if this really matters. Even if we eliminate that > loop in AtProcExit_Buffers altogether, is connect/disconnect still be so > slow that you have to use a connection pooler if you do that a lot? Oh, I'm sure this isn't going to be nearly enough to fix that problem, but every little bit helps; and if we never do the first optimization, we'll never get to #30 or wherever it is that it really starts to move the needle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Nov 24, 2010 at 2:10 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Micro-optimizing that search for the non-zero value helps a little bit >> (attached). Reduces the percentage shown by oprofile from about 16% to 12% >> on my laptop. That "micro-optimization" looks to me like your compiler leaves something to be desired. > The first optimization that occurred to me was "remove the loop > altogether". Or make it execute only in assert-enabled mode, perhaps. This check had some use back in the bad old days, but the ResourceOwner mechanism has probably removed a lot of the argument for it. The counter-argument might be that failing to remove a buffer pin would be disastrous; but I can't see that it'd be worse than failing to remove an LWLock, and we have no belt-and-suspenders-too loop for those. regards, tom lane
On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> The first optimization that occurred to me was "remove the loop >> altogether". > > Or make it execute only in assert-enabled mode, perhaps. > > This check had some use back in the bad old days, but the ResourceOwner > mechanism has probably removed a lot of the argument for it. Yeah, that's what I was thinking - this could would have been a good backstop when our cleanup mechanisms were not as robust as they seem to be today. But making the check execute only in assert-enabled more doesn't seem right, since the check actually acts to mask other coding errors, rather than reveal them. Maybe we replace the check with one that only occurs in an Assert-enabled build and just loops through and does Assert(PrivateRefCount[i] == 0). I'm not sure exactly where this gets called in the shutdown sequence, though - is it sensible to Assert() here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Or make it execute only in assert-enabled mode, perhaps. > But making the check execute only in assert-enabled more > doesn't seem right, since the check actually acts to mask other coding > errors, rather than reveal them. Maybe we replace the check with one > that only occurs in an Assert-enabled build and just loops through and > does Assert(PrivateRefCount[i] == 0). Yeah, that would be sensible. There is precedent for this elsewhere too; I think there's a similar setup for checking buffer refcounts during transaction cleanup. > I'm not sure exactly where this > gets called in the shutdown sequence, though - is it sensible to > Assert() here? Assert is sensible anywhere. regards, tom lane
On Wed, Nov 24, 2010 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Or make it execute only in assert-enabled mode, perhaps. > >> But making the check execute only in assert-enabled more >> doesn't seem right, since the check actually acts to mask other coding >> errors, rather than reveal them. Maybe we replace the check with one >> that only occurs in an Assert-enabled build and just loops through and >> does Assert(PrivateRefCount[i] == 0). > > Yeah, that would be sensible. There is precedent for this elsewhere > too; I think there's a similar setup for checking buffer refcounts > during transaction cleanup. > >> I'm not sure exactly where this >> gets called in the shutdown sequence, though - is it sensible to >> Assert() here? > > Assert is sensible anywhere. OK, patch attached. Here's what oprofile output looks like with this applied: 3505 10.4396 libc-2.11.2.so memset 2051 6.1089 libc-2.11.2.so memcpy 1686 5.0217 postgres AllocSetAlloc 1642 4.8907 postgres hash_search_with_hash_value 1247 3.7142 libc-2.11.2.so _int_malloc 1096 3.2644 libc-2.11.2.so fread 855 2.5466 ld-2.11.2.so do_lookup_x 723 2.1535 ld-2.11.2.so _dl_fixup 645 1.9211 ld-2.11.2.so strcmp 620 1.8467 postgres MemoryContextAllocZero Somehow I don't think I'm going to get much further with this without figuring out how to get oprofile to cough up a call graph. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > OK, patch attached. Two comments: 1. A comment would help, something like "Assert we released all buffer pins". 2. AtProcExit_LocalBuffers should be redone the same way, for consistency (it likely won't make any performance difference). Note the comment for AtProcExit_LocalBuffers, too; that probably needs to be changed along the lines of "If we missed any, and assertions aren't enabled, we'll fail later in DropRelFileNodeBuffers while trying to drop the temp rels". regards, tom lane
On Wed, Nov 24, 2010 at 1:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> OK, patch attached. > > Two comments: Revised patch attached. I tried configuring oprofile with --callgraph=10 and then running oprofile with -c, but it gives kooky looking output I can't interpret. For example: 6 42.8571 postgres record_in 8 57.1429 postgres pg_perm_setlocale 17035 5.7219 libc-2.11.2.so memcpy 17035 100.000 libc-2.11.2.so memcpy [self] Not that helpful. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Nov 24, 2010 at 01:20:36PM -0500, Robert Haas wrote: > On Wed, Nov 24, 2010 at 1:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> OK, patch attached. > > > > Two comments: > > Revised patch attached. > > I tried configuring oprofile with --callgraph=10 and then running > oprofile with -c, but it gives kooky looking output I can't interpret. > For example: > > 6 42.8571 postgres record_in > 8 57.1429 postgres pg_perm_setlocale > 17035 5.7219 libc-2.11.2.so memcpy > 17035 100.000 libc-2.11.2.so memcpy [self] > > Not that helpful. :-( Have a look at the wiki: http://wiki.postgresql.org/wiki/Profiling_with_OProfile#Additional_analysis > Robert Haas Regards, Gerhard Heift
On Wed, Nov 24, 2010 at 1:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I tried configuring oprofile with --callgraph=10 and then running > oprofile with -c, but it gives kooky looking output I can't interpret. It looks like the trick is to compile with -fno-omit-frame-pointer. New profiling run: 27563 10.3470 libc-2.11.2.so memset 15162 5.6917 libc-2.11.2.so memcpy 13471 5.0569 postgres hash_search_with_hash_value 13465 5.0547 postgres AllocSetAlloc 9513 3.5711 libc-2.11.2.so _int_malloc 8729 3.2768 libc-2.11.2.so fread 6336 2.3785 ld-2.11.2.so do_lookup_x 5788 2.1728 ld-2.11.2.so _dl_fixup 4995 1.8751 postgres MemoryContextAllocZero 4978 1.8687 ld-2.11.2.so strcmp Full results, and call graph, attached. The first obvious fact is that most of the memset overhead appears to be coming from InitCatCache. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wednesday 24 November 2010 19:01:32 Robert Haas wrote: > Somehow I don't think I'm going to get much further with this without > figuring out how to get oprofile to cough up a call graph. I think to do that sensibly you need CFLAGS="-O2 -fno-omit-frame-pointer"...
Gerhard Heift <ml-postgresql-20081012-3518@gheift.de> writes: > On Wed, Nov 24, 2010 at 01:20:36PM -0500, Robert Haas wrote: >> I tried configuring oprofile with --callgraph=10 and then running >> oprofile with -c, but it gives kooky looking output I can't interpret. > Have a look at the wiki: > http://wiki.postgresql.org/wiki/Profiling_with_OProfile#Additional_analysis The critical piece of information is there now, but it wasn't a minute ago. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Revised patch attached. The asserts in AtProcExit_LocalBuffers are a bit pointless since you forgot to remove the code that forcibly zeroes LocalRefCount[]... otherwise +1. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Full results, and call graph, attached. The first obvious fact is > that most of the memset overhead appears to be coming from > InitCatCache. AFAICT that must be the palloc0 calls that are zeroing out (mostly) the hash bucket headers. I don't see any real way to make that cheaper other than to cut the initial sizes of the hash tables (and add support for expanding them later, which is lacking in catcache ATM). Not convinced that that creates any net savings --- it might just save some cycles at startup in exchange for more cycles later, in typical backend usage. Making those hashtables expansible wouldn't be a bad thing in itself, mind you. regards, tom lane
On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Full results, and call graph, attached. The first obvious fact is >> that most of the memset overhead appears to be coming from >> InitCatCache. > > AFAICT that must be the palloc0 calls that are zeroing out (mostly) > the hash bucket headers. I don't see any real way to make that cheaper > other than to cut the initial sizes of the hash tables (and add support > for expanding them later, which is lacking in catcache ATM). Not > convinced that that creates any net savings --- it might just save > some cycles at startup in exchange for more cycles later, in typical > backend usage. > > Making those hashtables expansible wouldn't be a bad thing in itself, > mind you. The idea I had was to go the other way and say, hey, if these hash tables can't be expanded anyway, let's put them on the BSS instead of heap-allocating them. Any new pages we request from the OS will be zeroed anyway, but with palloc we then have to re-zero the allocated block anyway because palloc can return a memory that's been used, freed, and reused. However, for anything that only needs to be allocated once and never freed, and whose size can be known at compile time, that's not an issue. In fact, it wouldn't be that hard to relax the "known at compile time" constraint either. We could just declare: char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED]; ...and then peel off chunks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wednesday 24 November 2010 21:47:32 Robert Haas wrote: > On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> Full results, and call graph, attached. The first obvious fact is > >> that most of the memset overhead appears to be coming from > >> InitCatCache. > > > > AFAICT that must be the palloc0 calls that are zeroing out (mostly) > > the hash bucket headers. I don't see any real way to make that cheaper > > other than to cut the initial sizes of the hash tables (and add support > > for expanding them later, which is lacking in catcache ATM). Not > > convinced that that creates any net savings --- it might just save > > some cycles at startup in exchange for more cycles later, in typical > > backend usage. > > > > Making those hashtables expansible wouldn't be a bad thing in itself, > > mind you. > > The idea I had was to go the other way and say, hey, if these hash > tables can't be expanded anyway, let's put them on the BSS instead of > heap-allocating them. Any new pages we request from the OS will be > zeroed anyway, but with palloc we then have to re-zero the allocated > block anyway because palloc can return a memory that's been used, > freed, and reused. However, for anything that only needs to be > allocated once and never freed, and whose size can be known at compile > time, that's not an issue. > > In fact, it wouldn't be that hard to relax the "known at compile time" > constraint either. We could just declare: > > char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED]; > > ...and then peel off chunks. Won't this just cause loads of additional pagefaults after fork() when those pages are used the first time and then a second time when first written to (to copy it)? Andres
On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote: > On Wednesday 24 November 2010 21:47:32 Robert Haas wrote: >> On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Robert Haas <robertmhaas@gmail.com> writes: >> >> Full results, and call graph, attached. The first obvious fact is >> >> that most of the memset overhead appears to be coming from >> >> InitCatCache. >> > >> > AFAICT that must be the palloc0 calls that are zeroing out (mostly) >> > the hash bucket headers. I don't see any real way to make that cheaper >> > other than to cut the initial sizes of the hash tables (and add support >> > for expanding them later, which is lacking in catcache ATM). Not >> > convinced that that creates any net savings --- it might just save >> > some cycles at startup in exchange for more cycles later, in typical >> > backend usage. >> > >> > Making those hashtables expansible wouldn't be a bad thing in itself, >> > mind you. >> >> The idea I had was to go the other way and say, hey, if these hash >> tables can't be expanded anyway, let's put them on the BSS instead of >> heap-allocating them. Any new pages we request from the OS will be >> zeroed anyway, but with palloc we then have to re-zero the allocated >> block anyway because palloc can return a memory that's been used, >> freed, and reused. However, for anything that only needs to be >> allocated once and never freed, and whose size can be known at compile >> time, that's not an issue. >> >> In fact, it wouldn't be that hard to relax the "known at compile time" >> constraint either. We could just declare: >> >> char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED]; >> >> ...and then peel off chunks. > Won't this just cause loads of additional pagefaults after fork() when those > pages are used the first time and then a second time when first written to (to > copy it)? Aren't we incurring those page faults anyway, for whatever memory palloc is handing out? The heap is no different from bss; we just move the pointer with sbrk(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote: >>> The idea I had was to go the other way and say, hey, if these hash >>> tables can't be expanded anyway, let's put them on the BSS instead of >>> heap-allocating them. >> Won't this just cause loads of additional pagefaults after fork() when those >> pages are used the first time and then a second time when first written to (to >> copy it)? > Aren't we incurring those page faults anyway, for whatever memory > palloc is handing out? The heap is no different from bss; we just > move the pointer with sbrk(). I think you're missing the real point, which that the cost you're measuring here probably isn't so much memset() as faulting in large chunks of address space. Avoiding the explicit memset() likely will save little in real runtime --- it'll just make sure the initial-touch costs are more distributed and harder to measure. But in any case I think this idea is a nonstarter because it gets in the way of making those hashtables expansible, which we *do* need to do eventually. (You might be able to confirm or disprove this theory if you ask oprofile to count memory access stalls instead of CPU clock cycles...) regards, tom lane
On Wednesday 24 November 2010 21:54:53 Robert Haas wrote: > On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote: > > On Wednesday 24 November 2010 21:47:32 Robert Haas wrote: > >> On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> > Robert Haas <robertmhaas@gmail.com> writes: > >> >> Full results, and call graph, attached. The first obvious fact is > >> >> that most of the memset overhead appears to be coming from > >> >> InitCatCache. > >> > > >> > AFAICT that must be the palloc0 calls that are zeroing out (mostly) > >> > the hash bucket headers. I don't see any real way to make that > >> > cheaper other than to cut the initial sizes of the hash tables (and > >> > add support for expanding them later, which is lacking in catcache > >> > ATM). Not convinced that that creates any net savings --- it might > >> > just save some cycles at startup in exchange for more cycles later, > >> > in typical backend usage. > >> > > >> > Making those hashtables expansible wouldn't be a bad thing in itself, > >> > mind you. > >> > >> The idea I had was to go the other way and say, hey, if these hash > >> tables can't be expanded anyway, let's put them on the BSS instead of > >> heap-allocating them. Any new pages we request from the OS will be > >> zeroed anyway, but with palloc we then have to re-zero the allocated > >> block anyway because palloc can return a memory that's been used, > >> freed, and reused. However, for anything that only needs to be > >> allocated once and never freed, and whose size can be known at compile > >> time, that's not an issue. > >> > >> In fact, it wouldn't be that hard to relax the "known at compile time" > >> constraint either. We could just declare: > >> > >> char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED]; > >> > >> ...and then peel off chunks. > > > > Won't this just cause loads of additional pagefaults after fork() when > > those pages are used the first time and then a second time when first > > written to (to copy it)? > > Aren't we incurring those page faults anyway, for whatever memory > palloc is handing out? The heap is no different from bss; we just > move the pointer with sbrk(). Yes, but only once. Also scrubbing a page is faster than copying it... (and there were patches floating around to do that in advance, not sure if they got integrated into mainline linux) Andres
On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote: >>> >>> Won't this just cause loads of additional pagefaults after fork() when >>> those pages are used the first time and then a second time when first >>> written to (to copy it)? >> >> Aren't we incurring those page faults anyway, for whatever memory >> palloc is handing out? The heap is no different from bss; we just >> move the pointer with sbrk(). > Yes, but only once. Also scrubbing a page is faster than copying it... (and > there were patches floating around to do that in advance, not sure if they got > integrated into mainline linux) I'm not following - can you elaborate? ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote: >> Yes, but only once. Also scrubbing a page is faster than copying it... (and >> there were patches floating around to do that in advance, not sure if they got >> integrated into mainline linux) > I'm not following - can you elaborate? I think Andres is saying that bss space isn't optimized during a fork operation: it'll be propagated to the child as copy-on-write pages. Dunno if that's true or not, but if it is, it'd be a good reason to avoid the scheme you're suggesting. regards, tom lane
On Wednesday 24 November 2010 22:18:08 Robert Haas wrote: > On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote: > >>> Won't this just cause loads of additional pagefaults after fork() when > >>> those pages are used the first time and then a second time when first > >>> written to (to copy it)? > >> > >> Aren't we incurring those page faults anyway, for whatever memory > >> palloc is handing out? The heap is no different from bss; we just > >> move the pointer with sbrk(). > > > > Yes, but only once. Also scrubbing a page is faster than copying it... > > (and there were patches floating around to do that in advance, not sure > > if they got integrated into mainline linux) > I'm not following - can you elaborate? When forking the memory mapping of the process is copied - the actual pages are not. When a page is first accessed the page fault handler will setup a mapping to the "old" page and mark it as shared. When now written to it will fault again and copy the page. In contrast if you access a page the first time after an sbrk (or mmap, doesn't matter) a new page will get scrubbed and and a mapping will get setup. Andres
On Wednesday 24 November 2010 22:25:45 Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote: > >> Yes, but only once. Also scrubbing a page is faster than copying it... > >> (and there were patches floating around to do that in advance, not sure > >> if they got integrated into mainline linux) > > > > I'm not following - can you elaborate? > > I think Andres is saying that bss space isn't optimized during a fork > operation: it'll be propagated to the child as copy-on-write pages. > Dunno if that's true or not, but if it is, it'd be a good reason to > avoid the scheme you're suggesting. Afair nearly all pages are propagated with copy-on-write semantics. Andres
On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > (You might be able to confirm or disprove this theory if you ask > oprofile to count memory access stalls instead of CPU clock cycles...) I don't see an event for that. # opcontrol --list-events | grep STALL INSTRUCTION_FETCH_STALL: (counter: all) DISPATCH_STALLS: (counter: all) DISPATCH_STALL_FOR_BRANCH_ABORT: (counter: all) DISPATCH_STALL_FOR_SERIALIZATION: (counter: all) DISPATCH_STALL_FOR_SEGMENT_LOAD: (counter: all) DISPATCH_STALL_FOR_REORDER_BUFFER_FULL: (counter: all) DISPATCH_STALL_FOR_RESERVATION_STATION_FULL: (counter: all) DISPATCH_STALL_FOR_FPU_FULL: (counter: all) DISPATCH_STALL_FOR_LS_FULL: (counter: all) DISPATCH_STALL_WAITING_FOR_ALL_QUIET: (counter: all) DISPATCH_STALL_FOR_FAR_TRANSFER_OR_RESYNC: (counter: all) # opcontrol --list-events | grep MEMORY MEMORY_REQUESTS: (counter: all) MEMORY_CONTROLLER_PAGE_TABLE_OVERFLOWS: (counter: all) MEMORY_CONTROLLER_SLOT_MISSED: (counter: all) MEMORY_CONTROLLER_TURNAROUNDS: (counter: all) MEMORY_CONTROLLER_BYPASS_COUNTER_SATURATION: (counter: all) CPU_IO_REQUESTS_TO_MEMORY_IO: (counter: all) MEMORY_CONTROLLER_REQUESTS: (counter: all) Ideas? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> (You might be able to confirm or disprove this theory if you ask >> oprofile to count memory access stalls instead of CPU clock cycles...) > I don't see an event for that. You probably want something involving cache misses. The event names vary depending on just which CPU you've got. regards, tom lane
On Wednesday 24 November 2010 23:03:48 Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> (You might be able to confirm or disprove this theory if you ask > >> oprofile to count memory access stalls instead of CPU clock cycles...) > > > > I don't see an event for that. > > You probably want something involving cache misses. The event names > vary depending on just which CPU you've got. Or some BUS OUTSTANDING event. Andres
On Wed, Nov 24, 2010 at 5:15 PM, Andres Freund <andres@anarazel.de> wrote: > On Wednesday 24 November 2010 23:03:48 Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >> > On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> (You might be able to confirm or disprove this theory if you ask >> >> oprofile to count memory access stalls instead of CPU clock cycles...) >> > >> > I don't see an event for that. >> >> You probably want something involving cache misses. The event names >> vary depending on just which CPU you've got. > Or some BUS OUTSTANDING event. I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have some options: # opcontrol --list-events | grep CACHE DATA_CACHE_ACCESSES: (counter: all) DATA_CACHE_MISSES: (counter: all) DATA_CACHE_REFILLS_FROM_L2_OR_NORTHBRIDGE: (counter: all) DATA_CACHE_REFILLS_FROM_NORTHBRIDGE: (counter: all) DATA_CACHE_LINES_EVICTED: (counter: all) LOCKED_INSTRUCTIONS_DCACHE_MISSES: (counter: all) L2_CACHE_MISS: (counter: all) L2_CACHE_FILL_WRITEBACK: (counter: all) INSTRUCTION_CACHE_FETCHES: (counter: all) INSTRUCTION_CACHE_MISSES: (counter: all) INSTRUCTION_CACHE_REFILLS_FROM_L2: (counter: all) INSTRUCTION_CACHE_REFILLS_FROM_SYSTEM: (counter: all) INSTRUCTION_CACHE_VICTIMS: (counter: all) INSTRUCTION_CACHE_INVALIDATED: (counter: all) CACHE_BLOCK_COMMANDS: (counter: all) READ_REQUEST_L3_CACHE: (counter: all) L3_CACHE_MISSES: (counter: all) IBS_FETCH_ICACHE_MISSES: (ext: ibs_fetch) IBS_FETCH_ICACHE_HITS: (ext: ibs_fetch) IBS_OP_DATA_CACHE_MISS: (ext: ibs_op) IBS_OP_NB_LOCAL_CACHE: (ext: ibs_op) IBS_OP_NB_REMOTE_CACHE: (ext: ibs_op) IBS_OP_NB_CACHE_MODIFIED: (ext: ibs_op) IBS_OP_NB_CACHE_OWNED: (ext: ibs_op) IBS_OP_NB_LOCAL_CACHE_LAT: (ext: ibs_op) IBS_OP_NB_REMOTE_CACHE_LAT: (ext: ibs_op) # opcontrol --list-events | grep MISS | grep -v CACHE L1_DTLB_MISS_AND_L2_DTLB_HIT: (counter: all) L1_DTLB_AND_L2_DTLB_MISS: (counter: all) L1_ITLB_MISS_AND_L2_ITLB_HIT: (counter: all) L1_ITLB_MISS_AND_L2_ITLB_MISS: (counter: all) MEMORY_CONTROLLER_SLOT_MISSED: (counter: all) IBS_FETCH_L1_ITLB_MISSES_L2_ITLB_HITS: (ext: ibs_fetch) IBS_FETCH_L1_ITLB_MISSES_L2_ITLB_MISSES: (ext: ibs_fetch) IBS_OP_L1_DTLB_MISS_L2_DTLB_HIT: (ext: ibs_op) IBS_OP_L1_L2_DTLB_MISS: (ext: ibs_op) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have > some options: > DATA_CACHE_MISSES: (counter: all) > L3_CACHE_MISSES: (counter: all) Those two look promising, though I can't claim to be an expert. regards, tom lane
On Wed, Nov 24, 2010 at 5:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have >> some options: > >> DATA_CACHE_MISSES: (counter: all) >> L3_CACHE_MISSES: (counter: all) > > Those two look promising, though I can't claim to be an expert. OK. Thanksgiving is about to interfere with my access to this machine, but I'll pick this up next week. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > >> In fact, it wouldn't be that hard to relax the "known at compile time" > >> constraint either. ?We could just declare: > >> > >> char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED]; > >> > >> ...and then peel off chunks. > > Won't this just cause loads of additional pagefaults after fork() when those > > pages are used the first time and then a second time when first written to (to > > copy it)? > > Aren't we incurring those page faults anyway, for whatever memory > palloc is handing out? The heap is no different from bss; we just > move the pointer with sbrk(). Here is perhaps more detail than you wanted, but ... Basically in a forked process, the text/program is fixed, and the initialized data and stack are copy on write (COW). Allocating a big block of zero memory in data is unitialized data, and the behavior there differs depending on whether the parent process faulted in those pages. If it did, then they are COW, but if it did not, odds are the OS just gives them to you clean and not shared. The pages have to be empty because if it gave you anything else it could be giving you data from another process. For details, see http://docs.hp.com/en/5965-4641/ch01s11.html, Faulting In a Page of Stack or Uninitialized Data. As far as sbrk(), those pages are zero-filled also, again for security reasons. You have to clear malloc()'ed memory (or call calloc()) not because the OS gave you dirty pages but because you might be using memory that you previously freed. If you have never freed memory (and the postmaster/parent has not either), I bet all malloc'ed memory would be zero-filled. Not sure that information moves us forward. If the postmaster cleared the memory, we would have COW in the child and probably be even slower. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote: > Not sure that information moves us forward. If the postmaster cleared > the memory, we would have COW in the child and probably be even slower. Well, we can determine the answers to these questions empirically. I think some more scrutiny of the code with the points you and Andres and Tom have raised is probably in order, and probably some more benchmarking, too. I haven't had a chance to do that yet, however. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote: >> Not sure that information moves us forward. �If the postmaster cleared >> the memory, we would have COW in the child and probably be even slower. > Well, we can determine the answers to these questions empirically. Not really. Per Bruce's description, a page would become COW the moment the postmaster touched (either write or read) any variable on it. Since we have no control over how the loader lays out static variables, the actual behavior of a particular build would be pretty random and subject to unexpected changes caused by seemingly unrelated edits. Also, the referenced URL only purports to describe the behavior of HPUX, which is not exactly a mainstream OS. I think it requires a considerable leap of faith to assume that all or even most platforms work the way this suggests, and not in the dumber fashion Andres suggested. Has anybody here actually looked at the relevant Linux or BSD kernel code? regards, tom lane
On Sun, Nov 28, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote: >>> Not sure that information moves us forward. If the postmaster cleared >>> the memory, we would have COW in the child and probably be even slower. > >> Well, we can determine the answers to these questions empirically. > > Not really. Per Bruce's description, a page would become COW the moment > the postmaster touched (either write or read) any variable on it. Since > we have no control over how the loader lays out static variables, the > actual behavior of a particular build would be pretty random and subject > to unexpected changes caused by seemingly unrelated edits. Well, one big character array pretty much has to be laid out contiguously, and it would be pretty surprising (but not entirely impossible) to find that the linker randomly sprinkles symbols from other files in between consecutive definitions in the same source file. I think the next question to answer is to try to allocate blame for the memset/memcpy overhead between page faults and the zeroing itself. That seems like something we can easily member by writing a test program that zeroes the same region twice and kicks out timing numbers. If, as you and Andres are arguing, the actual zeroing is minor, then we can forget this whole line of discussion and move on to other possible optimizations. If that turns out not to be true then we can worry about how best to avoid the zeroing. I have to believe that's a solvable problem; the question is whether there's a benefit. In a close race, I don't think we should get bogged down in micro-optimization here, both because micro-optimizations may not gain much and because what works well on one platform may not do much at all on another. The more general issue here is what to do about our high backend startup costs. Beyond trying to recycle backends for new connections, as I've previous proposed and with all the problems it entails, the only thing that looks promising here is to try to somehow cut down on the cost of populating the catcache and relcache, not that I have a very clear idea how to do that. This has to be a soluble problem because other people have solved it. To some degree we're a victim of our own flexible and extensible architecture here, but I find it pretty unsatisfying to just say, OK, well, we're slow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote: > > Not sure that information moves us forward. ?If the postmaster cleared > > the memory, we would have COW in the child and probably be even slower. > > Well, we can determine the answers to these questions empirically. I > think some more scrutiny of the code with the points you and Andres > and Tom have raised is probably in order, and probably some more > benchmarking, too. I haven't had a chance to do that yet, however. Basically, my bet is if you allocated a large zero-data variable in the postmaster but never accessed it from the postmaster, at most you would copy-on-write (COW) fault in two page, one at the beginning that is shared by accessed variables, and one at the end. The remaining pages (4k default for x86) would be zero-filled and not COW shared. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas <robertmhaas@gmail.com> writes: > The more general issue here is what to do about our > high backend startup costs. Beyond trying to recycle backends for new > connections, as I've previous proposed and with all the problems it > entails, the only thing that looks promising here is to try to somehow > cut down on the cost of populating the catcache and relcache, not that > I have a very clear idea how to do that. One comment to make here is that it would be a serious error to focus on the costs of just starting and stopping a backend; you have to think about cases where the backend does at least some useful work in between, and that means actually *populating* those caches (to some extent) not just initializing them. Maybe your wording above was chosen with that in mind, but I think onlookers might easily overlook the point. FWIW, today I've been looking into getting rid of the silliness in build_index_pathkeys whereby it reconstructs pathkey opfamily OIDs from sortops instead of just using the index opfamilies directly. It turns out that once you fix that, there is no need at all for relcache to cache per-index operator data (the rd_operator arrays) because that's the only code that uses 'em. I don't see any particular change in the runtime of the regression tests from ripping out that part of the cached data, but it ought to have at least some beneficial effect on real startup time. regards, tom lane
On Sun, Nov 28, 2010 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> The more general issue here is what to do about our >> high backend startup costs. Beyond trying to recycle backends for new >> connections, as I've previous proposed and with all the problems it >> entails, the only thing that looks promising here is to try to somehow >> cut down on the cost of populating the catcache and relcache, not that >> I have a very clear idea how to do that. > > One comment to make here is that it would be a serious error to focus on > the costs of just starting and stopping a backend; you have to think > about cases where the backend does at least some useful work in between, > and that means actually *populating* those caches (to some extent) not > just initializing them. Maybe your wording above was chosen with that > in mind, but I think onlookers might easily overlook the point. I did have that in mind, but I agree the point is worth mentioning. So, for example, it wouldn't gain anything meaningful for us to postpone catcache initialization until someone executes a query. It would improve the synthetic benchmark, but that's it. > FWIW, today I've been looking into getting rid of the silliness in > build_index_pathkeys whereby it reconstructs pathkey opfamily OIDs > from sortops instead of just using the index opfamilies directly. > It turns out that once you fix that, there is no need at all for > relcache to cache per-index operator data (the rd_operator arrays) > because that's the only code that uses 'em. I don't see any particular > change in the runtime of the regression tests from ripping out that > part of the cached data, but it ought to have at least some beneficial > effect on real startup time. Wow. that's great. The fact that it simplifies the code is probably the main point, but obviously whatever cycles we can save during startup (and ongoing operation) are all to the good. One possible way to get a real speedup here would be to look for ways to trim the number of catcaches. But I'm not too convinced there's much water to squeeze out of that rock. After our recent conversation about KNNGIST, it occurred to me to wonder whether there's really any point in pretending that a user can usefully add an AM, both due to hard-wired planner knowledge and due to lack of any sort of extensible XLOG support. If not, we could potentially turn pg_am into a hardcoded lookup table rather than a modifiable catalog, which would also likely be faster; and perhaps reference AMs elsewhere with characters rather than OIDs. But even if this were judged a sensible thing to do I'm not very sure that even a purpose-built synthetic benchmark would be able to measure the speedup. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > After our recent conversation > about KNNGIST, it occurred to me to wonder whether there's really any > point in pretending that a user can usefully add an AM, both due to > hard-wired planner knowledge and due to lack of any sort of extensible > XLOG support. If not, we could potentially turn pg_am into a > hardcoded lookup table rather than a modifiable catalog, which would > also likely be faster; and perhaps reference AMs elsewhere with > characters rather than OIDs. But even if this were judged a sensible > thing to do I'm not very sure that even a purpose-built synthetic > benchmark would be able to measure the speedup. Well, the lack of extensible XLOG support is definitely a big handicap to building a *production* index AM as an add-on. But it's not such a handicap for development. And I don't believe that the planner is hardwired in any way that doesn't allow new index types. GIST and GIN have both been added successfully without kluging the planner. It does know a lot more about btree than other index types, but that doesn't mean you can't add a new index type that doesn't behave like btree; that's more reflective of where development effort has been spent. So I would consider the above idea a step backwards, and I doubt it would save anything meaningful anyway. regards, tom lane
On Sun, Nov 28, 2010 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> After our recent conversation >> about KNNGIST, it occurred to me to wonder whether there's really any >> point in pretending that a user can usefully add an AM, both due to >> hard-wired planner knowledge and due to lack of any sort of extensible >> XLOG support. If not, we could potentially turn pg_am into a >> hardcoded lookup table rather than a modifiable catalog, which would >> also likely be faster; and perhaps reference AMs elsewhere with >> characters rather than OIDs. But even if this were judged a sensible >> thing to do I'm not very sure that even a purpose-built synthetic >> benchmark would be able to measure the speedup. > > Well, the lack of extensible XLOG support is definitely a big handicap > to building a *production* index AM as an add-on. But it's not such a > handicap for development. Realistically, it's hard for me to imagine that anyone would go to the trouble of building it as a loadable module first and then converting it to part of core later on. That'd hardly be less work. > And I don't believe that the planner is > hardwired in any way that doesn't allow new index types. GIST and GIN > have both been added successfully without kluging the planner. We have 9 boolean flags to indicate the capabilities (or lack thereof) of AMs, and we only have 4 AMs. It seems altogether plausible to assume that the next AM we add could require flags 10 and 11. Heck, I think KNNGIST is going to require another flag... which will likely never be set for any AM other than GIST. > It does > know a lot more about btree than other index types, but that doesn't > mean you can't add a new index type that doesn't behave like btree; > that's more reflective of where development effort has been spent. > > So I would consider the above idea a step backwards, and I doubt it > would save anything meaningful anyway. That latter point, as far as I'm concerned, is the real nail in the coffin. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > One possible way to get a real speedup here would be to look for ways > to trim the number of catcaches. BTW, it's not going to help to remove catcaches that have a small initial size, as the pg_am cache certainly does. If the bucket zeroing cost is really something to minimize, it's only the caches with the largest nbuckets counts that are worth considering --- and we certainly can't remove those without penalty. regards, tom lane
BTW, this might be premature to mention pending some tests about mapping versus zeroing overhead, but it strikes me that there's more than one way to skin a cat. I still think the idea of statically allocated space sucks. But what if we rearranged things so that palloc0 doesn't consist of palloc-then-memset, but rather push the zeroing responsibility down into the allocator? In particular, I'm imagining that palloc0 with a sufficiently large space request --- more than a couple pages --- could somehow arrange to get space that's guaranteed zero already. And if the request isn't large, zeroing it isn't where our problem is anyhow. The most portable way to do that would be to use calloc insted of malloc, and hope that libc is smart enough to provide freshly-mapped space. It would be good to look and see whether glibc actually does so, of course. If not we might end up having to mess with sbrk for ourselves, and I'm not sure how pleasantly that interacts with malloc. Another question that would be worth asking here is whether the hand-baked MemSet macro still outruns memset on modern architectures. I think it's been quite a few years since that was last tested. regards, tom lane
On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The most portable way to do that would be to use calloc insted of malloc, > and hope that libc is smart enough to provide freshly-mapped space. > It would be good to look and see whether glibc actually does so, > of course. If not we might end up having to mess with sbrk for > ourselves, and I'm not sure how pleasantly that interacts with malloc. It's *supposed* to interact fine. The only thing I wonder is that I think malloc intentionally uses mmap for larger allocations but I'm not clear what the advantages are. Is it because it's a cheaper way to get zeroed bytes? Or just so that free has a hope of returning the allocations to the OS? > Another question that would be worth asking here is whether the > hand-baked MemSet macro still outruns memset on modern architectures. > I think it's been quite a few years since that was last tested. I know glibc has some sexy memset macros for cases where the size is a constant. I'm not sure there's been much of an advance in the general case though. This would tend to imply we should consider going the other direction of having the caller of palloc0 do the zeroing instead. Or making palloc0 a macro which expands to include calling memset with the parameter inlined. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Another question that would be worth asking here is whether the >> hand-baked MemSet macro still outruns memset on modern architectures. >> I think it's been quite a few years since that was last tested. > I know glibc has some sexy memset macros for cases where the size is a > constant. I'm not sure there's been much of an advance in the general > case though. This would tend to imply we should consider going the > other direction of having the caller of palloc0 do the zeroing > instead. Or making palloc0 a macro which expands to include calling > memset with the parameter inlined. Well, that was exactly the reason why we did it the way we do it. However, I think it's probably only node allocations where the size is likely to be constant and hence result in a win. Perhaps we should implement makeNode() differently from the general case. regards, tom lane
On Sun, Nov 28, 2010 at 7:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> One possible way to get a real speedup here would be to look for ways >> to trim the number of catcaches. > > BTW, it's not going to help to remove catcaches that have a small > initial size, as the pg_am cache certainly does. If the bucket zeroing > cost is really something to minimize, it's only the caches with the > largest nbuckets counts that are worth considering --- and we certainly > can't remove those without penalty. Yeah, very true. What's a bit frustrating about the whole thing is that we spend a lot of time pulling data into the caches that's basically static and never likely to change anywhere, ever. I bet the number of people for whom <(int4, int4) has any non-standard properties is somewhere between slim and none; and it might well be the case that formrdesc() is faster than reading the relcache init file, if we didn't need to worry about deviation from canonical. This is even more frustrating in the hypothetical situation where a backend can switch databases, because we have to blow away all of these cache entries that are 99.9% likely to be basically identical in the old and new databases. The relation descriptors for pg_class and pg_attribute are examples of things it would be nice to hardwire and never need to update. We are really pretty much screwed if there is any meaningful deviation from what is expected, but relpages, reltuples, and relfrozenxid - and maybe relacl or reloptions - can legitimately vary between databases. Maybe we could speed things up a bit if we got rid of the pg_attribute entries for the system attributes (except OID). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Yeah, very true. What's a bit frustrating about the whole thing is > that we spend a lot of time pulling data into the caches that's > basically static and never likely to change anywhere, ever. True. I wonder if we could do something like the relcache init file for the catcaches. > Maybe we could speed things up a bit if we got rid of the pg_attribute > entries for the system attributes (except OID). I used to have high hopes for that idea, but the column privileges patch broke it permanently. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: >> Well, the lack of extensible XLOG support is definitely a big handicap >> to building a *production* index AM as an add-on. But it's not such a >> handicap for development. > > Realistically, it's hard for me to imagine that anyone would go to the > trouble of building it as a loadable module first and then converting > it to part of core later on. That'd hardly be less work. Well, it depends a lot on external factors. Like for example willing to use the code before to spend the necessary QA time that is needed for it to land in core. Two particular examples come to mind, which are tsearch and KNN GiST. The main problem with integrating into core, AFAIUI, are related to code maitenance, not at all with code stability and quality of the addon itself. It's just so much easier to develop an external module… Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Yeah, very true. What's a bit frustrating about the whole thing is >> that we spend a lot of time pulling data into the caches that's >> basically static and never likely to change anywhere, ever. > > True. I wonder if we could do something like the relcache init file > for the catcaches. Maybe. It's hard to know exactly what to pull in, though, nor is it clear to me how much it would really save. You've got to keep the thing up to date somehow, too. I finally got around to doing some testing of page-faults-versus-actually-memory-initialization, using the attached test program, compiled with warnings, but without optimization. Typical results on MacOS X: first run: 297299 second run: 99653 And on Fedora 12 (2.6.32.23-170.fc12.x86_64): first run: 509309 second run: 114721 I guess the word "run" is misleading (I wrote the program in 5 minutes); it's just zeroing the same chunk twice and measuring the times. The difference is presumably the page fault overhead, which implies that faulting is two-thirds of the overhead on MacOS X and three-quarters of the overhead on Linux. This makes me pretty pessimistic about the chances of a meaningful speedup here. >> Maybe we could speed things up a bit if we got rid of the pg_attribute >> entries for the system attributes (except OID). > > I used to have high hopes for that idea, but the column privileges > patch broke it permanently. http://archives.postgresql.org/pgsql-hackers/2010-07/msg00151.php -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Monday 29 November 2010 17:57:51 Robert Haas wrote: > On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> Yeah, very true. What's a bit frustrating about the whole thing is > >> that we spend a lot of time pulling data into the caches that's > >> basically static and never likely to change anywhere, ever. > > > > True. I wonder if we could do something like the relcache init file > > for the catcaches. > > Maybe. It's hard to know exactly what to pull in, though, nor is it > clear to me how much it would really save. You've got to keep the > thing up to date somehow, too. > > I finally got around to doing some testing of > page-faults-versus-actually-memory-initialization, using the attached > test program, compiled with warnings, but without optimization. > Typical results on MacOS X: > > first run: 297299 > second run: 99653 > > And on Fedora 12 (2.6.32.23-170.fc12.x86_64): > > first run: 509309 > second run: 114721 Hm. A quick test shows that its quite a bit faster if you allocate memory with: size_t s = 512*1024*1024; char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE| MAP_ANONYMOUS, -1, 0); Andres
On Mon, Nov 29, 2010 at 12:24 PM, Andres Freund <andres@anarazel.de> wrote: > Hm. A quick test shows that its quite a bit faster if you allocate memory > with: > size_t s = 512*1024*1024; > char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE| > MAP_ANONYMOUS, -1, 0); Numbers? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Monday 29 November 2010 18:34:02 Robert Haas wrote: > On Mon, Nov 29, 2010 at 12:24 PM, Andres Freund <andres@anarazel.de> wrote: > > Hm. A quick test shows that its quite a bit faster if you allocate memory > > with: > > size_t s = 512*1024*1024; > > char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE| > > MAP_ANONYMOUS, -1, 0); > > Numbers? malloc alloc: 43 malloc memset1: 438763 malloc memset2: 98764 total: 537570 mmap alloc: 296065 mmap memset1: 99203 mmap memset2: 100608 total: 495876 But you don't actually need the memset1 in the mmap case as MAP_ANONYMOUS memory is already zeroed. We could actually use that knowledge even without MAP_POPULATE if we somehow keep track whether an allocated memory region is still zeroed. Taking that into account its: malloc alloc: 47 malloc memset1: 437819 malloc memset2: 98317 total: 536183 mmap alloc: 292904 mmap memset1: 1 mmap memset2: 99284 total: 392189 I am somewhat reluctant to believe thats the way to go. Andres
On Mon, Nov 29, 2010 at 9:24 AM, Andres Freund <andres@anarazel.de> wrote: > On Monday 29 November 2010 17:57:51 Robert Haas wrote: >> On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Robert Haas <robertmhaas@gmail.com> writes: >> >> Yeah, very true. What's a bit frustrating about the whole thing is >> >> that we spend a lot of time pulling data into the caches that's >> >> basically static and never likely to change anywhere, ever. >> > >> > True. I wonder if we could do something like the relcache init file >> > for the catcaches. >> >> Maybe. It's hard to know exactly what to pull in, though, nor is it >> clear to me how much it would really save. You've got to keep the >> thing up to date somehow, too. >> >> I finally got around to doing some testing of >> page-faults-versus-actually-memory-initialization, using the attached >> test program, compiled with warnings, but without optimization. >> Typical results on MacOS X: >> >> first run: 297299 >> second run: 99653 >> >> And on Fedora 12 (2.6.32.23-170.fc12.x86_64): >> >> first run: 509309 >> second run: 114721 > Hm. A quick test shows that its quite a bit faster if you allocate memory > with: > size_t s = 512*1024*1024; > char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE| > MAP_ANONYMOUS, -1, 0); Could you post the program? Are you sure you haven't just moved the page-fault time to a part of the code where it still exists, but just isn't being captured and reported? Cheers, Jeff
Robert Haas <robertmhaas@gmail.com> writes: > I guess the word "run" is misleading (I wrote the program in 5 > minutes); it's just zeroing the same chunk twice and measuring the > times. The difference is presumably the page fault overhead, which > implies that faulting is two-thirds of the overhead on MacOS X and > three-quarters of the overhead on Linux. Ah, cute solution to the measurement problem. I replicated the experiment just as a cross-check: Fedora 13 on x86_64 (recent Nehalem): first run: 346767 second run: 103143 Darwin on x86_64 (not-so-recent Penryn): first run: 341289 second run: 64535 HPUX on HPPA: first run: 2191136 second run: 1199879 (On the last two machines I had to cut the array size to 256MB to avoid swapping.) All builds with "gcc -O2". > This makes me pretty > pessimistic about the chances of a meaningful speedup here. Yeah, this is confirmation that what you are seeing in the original test is mostly about faulting pages in, not about the zeroing. I think it would still be interesting to revisit the micro-optimization of MemSet(), but it doesn't look like massive restructuring to avoid it altogether is going to be worthwhile. regards, tom lane
Jeff Janes <jeff.janes@gmail.com> writes: > Are you sure you haven't just moved the page-fault time to a part of > the code where it still exists, but just isn't being captured and > reported? I'm a bit suspicious about that too. Another thing to keep in mind is that Robert's original program doesn't guarantee that the char array is maxaligned; though reasonable implementations of memset should be able to use the same inner loop anyway for most of the array. I did some experimentation here and couldn't find any real difference in runtime between the original program and substituting a malloc() call for the static array allocation. Rolling in calloc in place of malloc/memset made no particular difference either, which says that Fedora 13's glibc does not have any optimization for that case as I'd hoped. regards, tom lane
On Mon, Nov 29, 2010 at 12:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > (On the last two machines I had to cut the array size to 256MB to avoid > swapping.) You weren't kidding about that "not so recent" part. :-) >> This makes me pretty >> pessimistic about the chances of a meaningful speedup here. > > Yeah, this is confirmation that what you are seeing in the original test > is mostly about faulting pages in, not about the zeroing. I think it > would still be interesting to revisit the micro-optimization of > MemSet(), but it doesn't look like massive restructuring to avoid it > altogether is going to be worthwhile. Yep. I think that what we've established here is that starting new processes all time time is just plain expensive, and we're going to have to start fewer of them if we want to make a meaningful improvement. My impression is that the process startup overhead is even higher on Windows, although I am not now nor have I ever been a Windows programmer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote: > >> Not sure that information moves us forward. �If the postmaster cleared > >> the memory, we would have COW in the child and probably be even slower. > > > Well, we can determine the answers to these questions empirically. > > Not really. Per Bruce's description, a page would become COW the moment > the postmaster touched (either write or read) any variable on it. Since > we have no control over how the loader lays out static variables, the > actual behavior of a particular build would be pretty random and subject > to unexpected changes caused by seemingly unrelated edits. I believe all linkers will put initialized data ("data" segment) before unitialized data ("bss" segment): http://en.wikipedia.org/wiki/Data_segment The only question is whether the linker has data and bss sharing the same VM page (4k), or whether a new VM page is used when starting the bss segment. > Also, the referenced URL only purports to describe the behavior of > HPUX, which is not exactly a mainstream OS. I think it requires a > considerable leap of faith to assume that all or even most platforms > work the way this suggests, and not in the dumber fashion Andres > suggested. Has anybody here actually looked at the relevant Linux > or BSD kernel code? I have years ago, but not recently. You can see the sections on Linux via objdump: $ objdump --headers /bin/ls/bin/ls: file format elf32-i386Sections:Idx Name Size VMA LMA Fileoff Algn... 24 .data 0000012c 080611a0 080611a0 000191a0 2**5 CONTENTS, ALLOC, LOAD, DATA25 .bss 00000c40 080612e0 080612e0 000192cc 2**5 ALLOC Based on this output, a new 4k page is not started for the 'bss' segment. It basically uses 32-byte alignment. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas wrote: > In a close race, I don't think we should get bogged down in > micro-optimization here, both because micro-optimizations may not gain > much and because what works well on one platform may not do much at > all on another. The more general issue here is what to do about our > high backend startup costs. Beyond trying to recycle backends for new > connections, as I've previous proposed and with all the problems it > entails, the only thing that looks promising here is to try to somehow > cut down on the cost of populating the catcache and relcache, not that > I have a very clear idea how to do that. This has to be a soluble > problem because other people have solved it. To some degree we're a > victim of our own flexible and extensible architecture here, but I > find it pretty unsatisfying to just say, OK, well, we're slow. Combining your last two sentences, I am not sure anyone with the flexibility we have has solved the "cache populating" problem. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Tom Lane wrote: > BTW, this might be premature to mention pending some tests about mapping > versus zeroing overhead, but it strikes me that there's more than one > way to skin a cat. I still think the idea of statically allocated space > sucks. But what if we rearranged things so that palloc0 doesn't consist > of palloc-then-memset, but rather push the zeroing responsibility down > into the allocator? In particular, I'm imagining that palloc0 with a > sufficiently large space request --- more than a couple pages --- could > somehow arrange to get space that's guaranteed zero already. And if the > request isn't large, zeroing it isn't where our problem is anyhow. > The most portable way to do that would be to use calloc insted of malloc, > and hope that libc is smart enough to provide freshly-mapped space. > It would be good to look and see whether glibc actually does so, > of course. If not we might end up having to mess with sbrk for > ourselves, and I'm not sure how pleasantly that interacts with malloc. Yes, I was going to suggest trying calloc(), either because we can get already-zeroed sbrk() memory, or because libc uses assembly language for zeroing memory, as some good libc's do. I know most kernels also use assembly for zeroing memory. > Another question that would be worth asking here is whether the > hand-baked MemSet macro still outruns memset on modern architectures. > I think it's been quite a few years since that was last tested. Yes, MemSet was found to be faster than calling a C function, but new testing is certainly warranted. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas wrote: > On Sun, Nov 28, 2010 at 7:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> One possible way to get a real speedup here would be to look for ways > >> to trim the number of catcaches. > > > > BTW, it's not going to help to remove catcaches that have a small > > initial size, as the pg_am cache certainly does. ?If the bucket zeroing > > cost is really something to minimize, it's only the caches with the > > largest nbuckets counts that are worth considering --- and we certainly > > can't remove those without penalty. > > Yeah, very true. What's a bit frustrating about the whole thing is > that we spend a lot of time pulling data into the caches that's > basically static and never likely to change anywhere, ever. I bet the > number of people for whom <(int4, int4) has any non-standard > properties is somewhere between slim and none; and it might well be > the case that formrdesc() is faster than reading the relcache init > file, if we didn't need to worry about deviation from canonical. This > is even more frustrating in the hypothetical situation where a backend > can switch databases, because we have to blow away all of these cache > entries that are 99.9% likely to be basically identical in the old and > new databases. It is very tempting to look at optimizations here, but I am worried we might head down the flat-files solution that caused continual problems in the past. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Greg Stark wrote: > On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The most portable way to do that would be to use calloc insted of malloc, > > and hope that libc is smart enough to provide freshly-mapped space. > > It would be good to look and see whether glibc actually does so, > > of course. ?If not we might end up having to mess with sbrk for > > ourselves, and I'm not sure how pleasantly that interacts with malloc. > > It's *supposed* to interact fine. The only thing I wonder is that I > think malloc intentionally uses mmap for larger allocations but I'm > not clear what the advantages are. Is it because it's a cheaper way to > get zeroed bytes? Or just so that free has a hope of returning the > allocations to the OS? Using mmap() so you can return large allocations to the OS is a neat trick, certainly. I am not sure who implements that. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Monday 29 November 2010 19:10:07 Tom Lane wrote: > Jeff Janes <jeff.janes@gmail.com> writes: > > Are you sure you haven't just moved the page-fault time to a part of > > the code where it still exists, but just isn't being captured and > > reported? > > I'm a bit suspicious about that too. Another thing to keep in mind > is that Robert's original program doesn't guarantee that the char > array is maxaligned; though reasonable implementations of memset > should be able to use the same inner loop anyway for most of the > array. Yes, I measured the time including mmap itself. I don't find it surprising its taking measurably shorter as it can just put up the mappings without explicitly faulting for each and every page. The benefit is too small to worry though, so ... The answer to Robert includes the timings + test program. Andres
On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote: > Rolling in calloc in place of > malloc/memset made no particular difference either, which says that > Fedora 13's glibc does not have any optimization for that case as I'd > hoped. glibc's calloc is either mmap of /dev/zero or malloc followed by memset.
Peter Eisentraut <peter_e@gmx.net> writes: > On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote: >> Rolling in calloc in place of >> malloc/memset made no particular difference either, which says that >> Fedora 13's glibc does not have any optimization for that case as I'd >> hoped. > glibc's calloc is either mmap of /dev/zero or malloc followed by memset. Hmm. I would have expected to see a difference then. Do you know what conditions are needed to cause the mmap to be used? regards, tom lane
On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote: > > In a close race, I don't think we should get bogged down in > micro-optimization here, both because micro-optimizations may not gain > much and because what works well on one platform may not do much at > all on another. The more general issue here is what to do about our > high backend startup costs. Beyond trying to recycle backends for new > connections, as I've previous proposed and with all the problems it > entails, Is there a particular discussion of that matter you could point me to? > the only thing that looks promising here is to try to somehow > cut down on the cost of populating the catcache and relcache, not that > I have a very clear idea how to do that. This has to be a soluble > problem because other people have solved it. Oracle's backend start up time seems to be way higher than PG's. Their main solution is something that is fundamentally a built in connection pooler with some bells and whistles built in. I'm not sure "other people" you had in mind--Oracle is generally the one that pops to my mind. > To some degree we're a > victim of our own flexible and extensible architecture here, but I > find it pretty unsatisfying to just say, OK, well, we're slow. What about "well OK, we have PGbouncer"? Are there fixable short-comings that it has which could make the issue less of an issue? Cheers, Jeff
On tis, 2010-11-30 at 15:49 -0500, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote: > >> Rolling in calloc in place of > >> malloc/memset made no particular difference either, which says that > >> Fedora 13's glibc does not have any optimization for that case as I'd > >> hoped. > > > glibc's calloc is either mmap of /dev/zero or malloc followed by memset. > > Hmm. I would have expected to see a difference then. Do you know what > conditions are needed to cause the mmap to be used? Check out the mallopt(3) man page. It contains a few tunable malloc options that may be useful for your investigation.
On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote: >> >> In a close race, I don't think we should get bogged down in >> micro-optimization here, both because micro-optimizations may not gain >> much and because what works well on one platform may not do much at >> all on another. The more general issue here is what to do about our >> high backend startup costs. Beyond trying to recycle backends for new >> connections, as I've previous proposed and with all the problems it >> entails, > > Is there a particular discussion of that matter you could point me to? > >> the only thing that looks promising here is to try to somehow >> cut down on the cost of populating the catcache and relcache, not that >> I have a very clear idea how to do that. This has to be a soluble >> problem because other people have solved it. > > Oracle's backend start up time seems to be way higher than PG's. > Their main solution is something that is fundamentally a built in > connection pooler with some bells and whistles built in. I'm not > sure "other people" you had in mind--Oracle is generally the one that > pops to my mind. Interesting. How about MySQL and SQL Server? >> To some degree we're a >> victim of our own flexible and extensible architecture here, but I >> find it pretty unsatisfying to just say, OK, well, we're slow. > > What about "well OK, we have PGbouncer"? Are there fixable > short-comings that it has which could make the issue less of an issue? We do have pgbouncer, and pgpool-II, and that's a good thing. But it also requires proxying every interaction with the database through an intermediate piece of software, which is not free. An in-core solution ought to be able to arrange for each new connection to be directly attached to an existing backend, using file-descriptor passing. Tom has previously complained that this isn't portable, but a little research suggests that it is supported on at least Linux, Mac OS X, FreeBSD, OpenBSD, Solaris, and Windows, so in practice the percentage of our user base who could benefit seems like it would likely be very high. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wednesday 01 December 2010 15:20:32 Robert Haas wrote: > On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > > On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote: > >> To some degree we're a > >> victim of our own flexible and extensible architecture here, but I > >> find it pretty unsatisfying to just say, OK, well, we're slow. > > > > What about "well OK, we have PGbouncer"? Are there fixable > > short-comings that it has which could make the issue less of an issue? > > We do have pgbouncer, and pgpool-II, and that's a good thing. But it > also requires proxying every interaction with the database through an > intermediate piece of software, which is not free. An in-core > solution ought to be able to arrange for each new connection to be > directly attached to an existing backend, using file-descriptor > passing. Tom has previously complained that this isn't portable, but > a little research suggests that it is supported on at least Linux, Mac > OS X, FreeBSD, OpenBSD, Solaris, and Windows, so in practice the > percentage of our user base who could benefit seems like it would > likely be very high. HPUX and AIX allow fd transfer as well. I still don't see what even remotely relevant platform would be a problem. Andres
Robert Haas <robertmhaas@gmail.com> wrote: > Jeff Janes <jeff.janes@gmail.com> wrote: >> Oracle's backend start up time seems to be way higher than PG's. > Interesting. How about MySQL and SQL Server? My recollection of Sybase ASE is that establishing a connection doesn't start a backend or even a thread. It establishes a network connection and associates network queues and a connection context structure with it. "Engine" threads with CPU affinity (and a few miscellaneous "worker" threads, too, if I remember right) do all the work in a queue-based fashion. Last I worked with MS SQL Server it was based on the Sybase code and therefore worked the same way. I know they've made a lot of changes in the last five years (including switching to MVCC and adding snapshot isolation in addition to the already-existing serializable isolation), so I don't know whether connection startup cost has changed along the way. -Kevin
On Wed, Dec 1, 2010 at 6:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote: >>> >>> In a close race, I don't think we should get bogged down in >>> micro-optimization here, both because micro-optimizations may not gain >>> much and because what works well on one platform may not do much at >>> all on another. The more general issue here is what to do about our >>> high backend startup costs. Beyond trying to recycle backends for new >>> connections, as I've previous proposed and with all the problems it >>> entails, >> >> Is there a particular discussion of that matter you could point me to? >> >>> the only thing that looks promising here is to try to somehow >>> cut down on the cost of populating the catcache and relcache, not that >>> I have a very clear idea how to do that. This has to be a soluble >>> problem because other people have solved it. >> >> Oracle's backend start up time seems to be way higher than PG's. >> Their main solution is something that is fundamentally a built in >> connection pooler with some bells and whistles built in. I'm not >> sure "other people" you had in mind--Oracle is generally the one that >> pops to my mind. > > Interesting. How about MySQL and SQL Server? I don't have experience with MS SQL Server, and don't know how it performs on that front. I haven't really considered MySQL to be a "real" RDBMS, more of just an indexing system, although I guess it is steadily becoming more featurefull. It is indisputably faster at making connections than PG, but still much slower than a connection pooler. > >>> To some degree we're a >>> victim of our own flexible and extensible architecture here, but I >>> find it pretty unsatisfying to just say, OK, well, we're slow. >> >> What about "well OK, we have PGbouncer"? Are there fixable >> short-comings that it has which could make the issue less of an issue? > > We do have pgbouncer, and pgpool-II, and that's a good thing. But it > also requires proxying every interaction with the database through an > intermediate piece of software, which is not free. True, a simple in-memory benchmark with pgbench -S -c1 showed 10,000 tps connecting straight, and 7000 tps through pgbouncer. But if people want to make and breaks 100s of connections per second, they must not be doing very many queries per connection so I don't know how relevant that per query slow-down is. > An in-core > solution ought to be able to arrange for each new connection to be > directly attached to an existing backend, using file-descriptor > passing. But who would be doing the passing? For the postmaster to be doing that would probably go against the minimalist design. It would have to keep track of which backend is available, and which db and user it is primed for. Perhaps a feature could be added to the backend to allow it to get passed a FD from pgbouncer or pgpool-II and then hand control back to the pooler upon "close" of the connection, as they already have the infrastructure to keep pools around while the postmaster does not. Are pgbouncer and pgpool close enough to "core" to make such intimate collaboration with the backend OK? Cheers, Jeff
On 12/01/2010 05:32 AM, Jeff Janes wrote: > On 11/28/10, Robert Haas<robertmhaas@gmail.com> wrote: >> >> In a close race, I don't think we should get bogged down in >> micro-optimization here, both because micro-optimizations may not gain >> much and because what works well on one platform may not do much at >> all on another. The more general issue here is what to do about our >> high backend startup costs. Beyond trying to recycle backends for new >> connections, as I've previous proposed and with all the problems it >> entails, > > Is there a particular discussion of that matter you could point me to? > >> the only thing that looks promising here is to try to somehow >> cut down on the cost of populating the catcache and relcache, not that >> I have a very clear idea how to do that. This has to be a soluble >> problem because other people have solved it. > > Oracle's backend start up time seems to be way higher than PG's. > Their main solution is something that is fundamentally a built in > connection pooler with some bells and whistles built in. I'm not > sure "other people" you had in mind--Oracle is generally the one that > pops to my mind. > >> To some degree we're a >> victim of our own flexible and extensible architecture here, but I >> find it pretty unsatisfying to just say, OK, well, we're slow. > > > What about "well OK, we have PGbouncer"? Are there fixable > short-comings that it has which could make the issue less of an issue? well I would very much like to seen an integrated pooler in postgresql - pgbouncer is a very nice piece of software (and might even be a base for an integrated bouncer), but being not closely tied to the backend you are loosing a lot. One of the more obvious examples is that now that we have no flatfile copy of pg_authid you have to use cruel hacks like: http://www.depesz.com/index.php/2010/12/04/auto-refreshing-password-file-for-pgbouncer/ to get "automatic" management of roles. There are some other drawbacks as well: * no coordination of restarts/configuration changes between the cluster and the pooler * you have two pieces of config files to configure your pooling settings (having all that available say in a catalog in pg would be awesome) * you lose all of the advanced authentication features of pg (because all connections need to go through the pooler) and also ip-based stuff * no SSL support(in the case of pgbouncer) * complexity in reseting backend state (we added some support for that through explicit SQL level commands in recent releases but it still is a hazard) Stefan
> * no coordination of restarts/configuration changes between the cluster > and the pooler > * you have two pieces of config files to configure your pooling settings > (having all that available say in a catalog in pg would be awesome) > * you lose all of the advanced authentication features of pg (because > all connections need to go through the pooler) and also ip-based stuff > * no SSL support(in the case of pgbouncer) > * complexity in reseting backend state (we added some support for that > through explicit SQL level commands in recent releases but it still is a > hazard) More: * pooler logs to separate file, for which there are (currently) no anaysis tools * pooling is incompatible with the use of ROLES for data security The last is a major issue, and not one I think we can easily resolve. MySQL has a pooling-friendly user system, because when you connect to MySQL you basically always connect as the superuser and on connection it switches you to your chosen login role. This, per Rob Wulsch, is one of the things at the heart of allowing MySQL to support 100,000 low frequency users per cheap hosting system. As you might imagine, this behavior is also the source of a lot of MySQL's security bugs. I don't see how we could imitate it without getting the bugs as well. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Sun, Dec 5, 2010 at 11:59 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> * no coordination of restarts/configuration changes between the cluster >> and the pooler >> * you have two pieces of config files to configure your pooling settings >> (having all that available say in a catalog in pg would be awesome) >> * you lose all of the advanced authentication features of pg (because >> all connections need to go through the pooler) and also ip-based stuff >> * no SSL support(in the case of pgbouncer) >> * complexity in reseting backend state (we added some support for that >> through explicit SQL level commands in recent releases but it still is a >> hazard) > > More: > > * pooler logs to separate file, for which there are (currently) no anaysis > tools > * pooling is incompatible with the use of ROLES for data security > > The last is a major issue, and not one I think we can easily resolve. MySQL > has a pooling-friendly user system, because when you connect to MySQL you > basically always connect as the superuser and on connection it switches you > to your chosen login role. This, per Rob Wulsch, is one of the things at > the heart of allowing MySQL to support 100,000 low frequency users per cheap > hosting system. > > As you might imagine, this behavior is also the source of a lot of MySQL's > security bugs. I don't see how we could imitate it without getting the bugs > as well. > > I think you have read a bit more into what I have said than is correct. MySQL can deal with thousands of users and separate schemas on commodity hardware. There are many design decisions (some questionable) that have made MySQL much better in a shared hosting environment than pg and I don't know where the grants system falls into that. MySQL does not have that many security problems because of how grants are stored. Most MySQL security issues are DOS sort of stuff based on a authenticated user being able to cause a crash. The decoupled backend storage and a less than awesome parser shared most of the blame for these issues. One thing I would suggest that the PG community keeps in mind while talking about built in connection process caching, is that it is very nice feature for memory leaks caused by a connection to not exist for and continue growing forever. NOTE: 100k is not a number that I would put much stock in. I don't recall ever mentioning that number and it is not a number that would be truthful for me to throw out. -- Rob Wultsch wultsch@gmail.com
On Sun, Dec 5, 2010 at 12:45 PM, Rob Wultsch <wultsch@gmail.com> wrote: > One thing I would suggest that the PG community keeps in mind while > talking about built in connection process caching, is that it is very > nice feature for memory leaks caused by a connection to not exist for > and continue growing forever. s/not exist for/not exist/ I have had issues with very slow leaks in MySQL building up over months. It really sucks to have to go to management to ask for downtime because of a slow memory leak. -- Rob Wultsch wultsch@gmail.com
On Sun, Dec 5, 2010 at 3:17 PM, Rob Wultsch <wultsch@gmail.com> wrote: > On Sun, Dec 5, 2010 at 12:45 PM, Rob Wultsch <wultsch@gmail.com> wrote: >> One thing I would suggest that the PG community keeps in mind while >> talking about built in connection process caching, is that it is very >> nice feature for memory leaks caused by a connection to not exist for >> and continue growing forever. > > s/not exist for/not exist/ > > I have had issues with very slow leaks in MySQL building up over > months. It really sucks to have to go to management to ask for > downtime because of a slow memory leak. Apache has a very simple and effective solution to this problem - they have a configuration option controlling the number of connections a child process handles before it dies and a new one is spawned. I've found that setting this to 1000 works excellently. Process startup overhead decreases by three orders of magnitude, and only egregiously bad leaks add up to enough to matter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch@gmail.com> wrote: > I think you have read a bit more into what I have said than is > correct. MySQL can deal with thousands of users and separate schemas > on commodity hardware. There are many design decisions (some > questionable) that have made MySQL much better in a shared hosting > environment than pg and I don't know where the grants system falls > into that. Objection: Vague. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Dec 4, 2010 at 8:04 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > But who would be doing the passing? For the postmaster to be doing > that would probably go against the minimalist design. It would have > to keep track of which backend is available, and which db and user it > is primed for. Perhaps a feature could be added to the backend to > allow it to get passed a FD from pgbouncer or pgpool-II and then hand > control back to the pooler upon "close" of the connection, as they > already have the infrastructure to keep pools around while the > postmaster does not. Are pgbouncer and pgpool close enough to "core" > to make such intimate collaboration with the backend OK? I am not sure. I'm afraid that might be adding complexity without really solving anything, but maybe I'm a pessimist. One possible way to do make an improvement in this area would be to move the responsibility for accepting connections out of the postmaster. Instead, you'd have a group of children that would all call accept() on the socket, and the OS would arbitrarily pick one to receive each new incoming connection. The postmaster would just be responsible for making sure that there were enough children hanging around. You could in fact make this change without doing anything else, in which case it wouldn't save any work but would possibly reduce connection latency a bit since more of the work could be done before the connection actually arrived. From there, you could go two ways. One option would be to have backends that would otherwise terminate normally instead do the equivalent of DISCARD ALL and then go back around and try to accept() another incoming connection. If they get a guy who wants the database to which they previously connected, profit.If not, laboriously flush every cache in sight andrebind to the new database. Another option would be to have backends that would otherwise terminate normally instead do the equivalent of DISCARD ALL and then mark themselves as able to accept a new connection to the same database to which they are already connected (but not any other database). Following authentication, a backend that accepted a new incoming connection looks through the pool of such backends and, if it finds one, hands off the connection using file-descriptor passing and then loops back around to accept() again. Otherwise it handles the connection itself. This wouldn't offer much of an advantage over the first option for a cluster that basically has just one database, or for a cluster that has 1000 actively used databases. But it would be much better for a system with three databases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Dec 5, 2010 at 6:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch@gmail.com> wrote: >> I think you have read a bit more into what I have said than is >> correct. MySQL can deal with thousands of users and separate schemas >> on commodity hardware. There are many design decisions (some >> questionable) that have made MySQL much better in a shared hosting >> environment than pg and I don't know where the grants system falls >> into that. > > Objection: Vague. > I retract the remark, your honor. At some point Hackers should look at pg vs MySQL multi tenantry but it is way tangential today. -- Rob Wultsch wultsch@gmail.com
On Sun, Dec 5, 2010 at 9:35 PM, Rob Wultsch <wultsch@gmail.com> wrote: > On Sun, Dec 5, 2010 at 6:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch@gmail.com> wrote: >>> I think you have read a bit more into what I have said than is >>> correct. MySQL can deal with thousands of users and separate schemas >>> on commodity hardware. There are many design decisions (some >>> questionable) that have made MySQL much better in a shared hosting >>> environment than pg and I don't know where the grants system falls >>> into that. >> >> Objection: Vague. > > I retract the remark, your honor. Clarifying it would be fine, too... :-) > At some point Hackers should look at pg vs MySQL multi tenantry but it > is way tangential today. My understanding is that our schemas work like MySQL databases; and our databases are an even higher level of isolation. No? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > One possible way to do make an improvement in this area would be to > move the responsibility for accepting connections out of the > postmaster. Instead, you'd have a group of children that would all > call accept() on the socket, and the OS would arbitrarily pick one to > receive each new incoming connection. The postmaster would just be > responsible for making sure that there were enough children hanging > around. You could in fact make this change without doing anything > else, in which case it wouldn't save any work but would possibly > reduce connection latency a bit since more of the work could be done > before the connection actually arrived. This seems like potentially a good idea independent of anything else, just to reduce connection latency: fork() (not to mention exec() on Windows) now happens before not after receipt of the connection request. However, I see a couple of stumbling blocks: 1. Does accept() work that way everywhere (Windows, I'm looking at you) 2. What do you do when max_connections is exceeded, and you don't have anybody at all listening on the socket? Right now we are at least able to send back an error message explaining the problem. Another issue that would require some thought is what algorithm the postmaster uses for deciding to spawn new children. But that doesn't sound like a potential showstopper. regards, tom lane
On 12/06/2010 09:38 AM, Tom Lane wrote: > Another issue that would require some thought is what algorithm the > postmaster uses for deciding to spawn new children. But that doesn't > sound like a potential showstopper. We'd probably want a couple of different ones, optimized for different connection patterns. Realistically. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Mon, Dec 6, 2010 at 12:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> One possible way to do make an improvement in this area would be to >> move the responsibility for accepting connections out of the >> postmaster. Instead, you'd have a group of children that would all >> call accept() on the socket, and the OS would arbitrarily pick one to >> receive each new incoming connection. The postmaster would just be >> responsible for making sure that there were enough children hanging >> around. You could in fact make this change without doing anything >> else, in which case it wouldn't save any work but would possibly >> reduce connection latency a bit since more of the work could be done >> before the connection actually arrived. > > This seems like potentially a good idea independent of anything else, > just to reduce connection latency: fork() (not to mention exec() on > Windows) now happens before not after receipt of the connection request. > However, I see a couple of stumbling blocks: > > 1. Does accept() work that way everywhere (Windows, I'm looking at you) Not sure. It might be useful to look at what Apache does, but I don't have time to do that ATM. > 2. What do you do when max_connections is exceeded, and you don't have > anybody at all listening on the socket? Right now we are at least able > to send back an error message explaining the problem. Sending back an error message explaining the problem seems like a non-negotiable requirement. I'm not quite sure how to dance around this. Perhaps if max_connections is exhausted, the postmaster itself joins the accept() queue and launches a dead-end backend for each new connection. Or perhaps we reserve one extra backend slot for a probably-dead-end backend that will just sit there and mail rejection notices; except that if it sees that a regular backend slot has opened up it grabs it and turns itself into a regular backend. > Another issue that would require some thought is what algorithm the > postmaster uses for deciding to spawn new children. But that doesn't > sound like a potential showstopper. The obvious algorithm would be to try to keep N spare workers around. Any time the number of unconnected backends drops below N the postmaster starts spawning new ones until it gets back up to N. I think the trick may not be the algorithm so much as finding a way to make the signaling sufficiently robust and lightweight. For example, I bet having each child that gets a new connection signal() the postmaster is a bad plan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>> At some point Hackers should look at pg vs MySQL multi tenantry but it >> is way tangential today. > > My understanding is that our schemas work like MySQL databases; and > our databases are an even higher level of isolation. No? That's correct. Drizzle is looking at implementing a feature like our databases called "catalogs" (per the SQL spec). Let me stress that not everyone is happy with the MySQL multi-tenantry approach. But it does make multi-tenancy on a scale which you seldom see with PG possible, even if it has problems. It's worth seeing whether we can steal any of their optimization ideas without breaking PG. I was specifically looking at the login model, which works around the issue that we have: namely that different login ROLEs can't share a connection pool. In MySQL, they can share the built-in connection "pool" because role-switching effectively is a session variable. AFAICT, anyway. For that matter, if anyone knows any other DB which does multi-tenant well/better, we should be looking at them too. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Mon, Dec 6, 2010 at 12:57 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> At some point Hackers should look at pg vs MySQL multi tenantry but it >>> is way tangential today. >> >> My understanding is that our schemas work like MySQL databases; and >> our databases are an even higher level of isolation. No? > > That's correct. Drizzle is looking at implementing a feature like our > databases called "catalogs" (per the SQL spec). > > Let me stress that not everyone is happy with the MySQL multi-tenantry > approach. But it does make multi-tenancy on a scale which you seldom see > with PG possible, even if it has problems. It's worth seeing whether we can > steal any of their optimization ideas without breaking PG. Please make sure to articulate what you think is wrong with our existing model. > I was specifically looking at the login model, which works around the issue > that we have: namely that different login ROLEs can't share a connection > pool. In MySQL, they can share the built-in connection "pool" because > role-switching effectively is a session variable. AFAICT, anyway. Please explain more precisely what is wrong with SET SESSION AUTHORIZATION / SET ROLE. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Please explain more precisely what is wrong with SET SESSION > AUTHORIZATION / SET ROLE. 1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't had any time to work on) 2) Users can always issue their own SET ROLE and then "hack into" other users' data. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> Please explain more precisely what is wrong with SET SESSION >> AUTHORIZATION / SET ROLE. > > 1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't > had any time to work on) > > 2) Users can always issue their own SET ROLE and then "hack into" other > users' data. Makes sense. It would be nice to fix those issues, independent of anything else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Robert Haas's message of lun dic 06 23:09:56 -0300 2010: > On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh@agliodbs.com> wrote: > > > >> Please explain more precisely what is wrong with SET SESSION > >> AUTHORIZATION / SET ROLE. > > > > 1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't > > had any time to work on) > > > > 2) Users can always issue their own SET ROLE and then "hack into" other > > users' data. > > Makes sense. It would be nice to fix those issues, independent of > anything else. It seems plausible to fix the first one, but how would you fix the second one? You either allow SET ROLE (which you need, to support the pooler changing authorization), or you don't. There doesn't seem to be a usable middleground. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
> It seems plausible to fix the first one, but how would you fix the > second one? You either allow SET ROLE (which you need, to support the > pooler changing authorization), or you don't. There doesn't seem to be > a usable middleground. Well, this is why such a pooler would *have* to be built into the backend. It would need to be able to SET ROLE even though SET ROLE would not be accepted over the client connection. We'd also need bookkeeping to track the ROLE (and other GUCs) of each client connection and reset them whenever that client connection switches back. Mind you, I'm not entirely convinced that the end result of this would be performant. And they would certainly be complicated. I think that we should start by dealing with the simplest situation, ignoring SET ROLE and GUC issues for now. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Mon, Dec 6, 2010 at 9:37 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Robert Haas's message of lun dic 06 23:09:56 -0300 2010: >> On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh@agliodbs.com> wrote: >> > >> >> Please explain more precisely what is wrong with SET SESSION >> >> AUTHORIZATION / SET ROLE. >> > >> > 1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't >> > had any time to work on) >> > >> > 2) Users can always issue their own SET ROLE and then "hack into" other >> > users' data. >> >> Makes sense. It would be nice to fix those issues, independent of >> anything else. > > It seems plausible to fix the first one, but how would you fix the > second one? You either allow SET ROLE (which you need, to support the > pooler changing authorization), or you don't. There doesn't seem to be > a usable middleground. You could add a protocol message that does a "permanent" role switch in a way that can't be undone except by another such protocol message.Then connection poolers could simply refuse to proxythat particular message. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/12/10 10:48, Josh Berkus wrote: > >> It seems plausible to fix the first one, but how would you fix the >> second one? You either allow SET ROLE (which you need, to support the >> pooler changing authorization), or you don't. There doesn't seem to be >> a usable middleground. > > Well, this is why such a pooler would *have* to be built into the > backend. It would need to be able to SET ROLE even though SET ROLE > would not be accepted over the client connection. There's actually another way to do that that could be retrofitted onto an existing external pooler. It's not lovely, but if the approach above proved too hard... SET ROLE could accept a cookie / one-time password that had to be passed to RESET ROLE in order for RESET ROLE to accept the command. SET ROLE fred WITH COOKIE 'goqu8Mi6choht8ie'; -- hand to the user -- blah blah user work blah -- returned by the user RESET ROLE WITH COOKIE 'goqu8Mi6choht8ie'; The tricky bit might be that the user should still be permitted to SET ROLE, but only to roles that the role the pooler switched them to ("fred") has rights to SET ROLE to, not to roles that the pooler user its self has rights to switch to. > We'd also need > bookkeeping to track the ROLE (and other GUCs) of each client connection > and reset them whenever that client connection switches back. I'm really interested in this direction. Taken just a little further, it could bring Pg to the point where query executors (backends) are separated from connection state, so a given backend could pick up and work on queries by several different connections in rapid succession. The advantage there is that idle connections would become cheap, low-overhead affairs. As I (poorly) understand how Pg is designed it'd only be possible for a backend to work on queries that act on the same database, it couldn't really switch databases. That'd still be a real bonus especially for newer users who don't realize they *need* a connection pool. -- System & Network Administrator POST Newspapers