Thread: pgsql: Generational memory allocator
Generational memory allocator Add new style of memory allocator, known as Generational appropriate for use in cases where memory is allocated and then freed in roughly oldest first order (FIFO). Use new allocator for logical decoding’s reorderbuffer to significantly reduce memory usage and improve performance. Author: Tomas Vondra Reviewed-by: Simon Riggs Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/a4ccc1cef5a04cc054af83bc4582a045d5232cb3 Modified Files -------------- src/backend/replication/logical/reorderbuffer.c | 80 +-- src/backend/utils/mmgr/Makefile | 2 +- src/backend/utils/mmgr/README | 23 + src/backend/utils/mmgr/generation.c | 768 ++++++++++++++++++++++++ src/include/nodes/memnodes.h | 4 +- src/include/nodes/nodes.h | 1 + src/include/replication/reorderbuffer.h | 15 +- src/include/utils/memutils.h | 5 + 8 files changed, 819 insertions(+), 79 deletions(-)
Hi, On 2017-11-22 18:48:19 +0000, Simon Riggs wrote: > Generational memory allocator > > Add new style of memory allocator, known as Generational > appropriate for use in cases where memory is allocated > and then freed in roughly oldest first order (FIFO). > > Use new allocator for logical decoding’s reorderbuffer > to significantly reduce memory usage and improve performance. Looks like it's not quite valgrind clean: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-11-22%2022%3A30%3A01 Greetings, Andres Freund
On 23 November 2017 at 11:16, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2017-11-22 18:48:19 +0000, Simon Riggs wrote: >> Generational memory allocator >> >> Add new style of memory allocator, known as Generational >> appropriate for use in cases where memory is allocated >> and then freed in roughly oldest first order (FIFO). >> >> Use new allocator for logical decoding’s reorderbuffer >> to significantly reduce memory usage and improve performance. > > Looks like it's not quite valgrind clean: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-11-22%2022%3A30%3A01 It doesn't report anything useful that would allow me to fix this. Please enable some info reporting on the animal. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/23/2017 10:57 AM, Simon Riggs wrote: > On 23 November 2017 at 11:16, Andres Freund <andres@anarazel.de> wrote: >> Hi, >> >> On 2017-11-22 18:48:19 +0000, Simon Riggs wrote: >>> Generational memory allocator >>> >>> Add new style of memory allocator, known as Generational >>> appropriate for use in cases where memory is allocated >>> and then freed in roughly oldest first order (FIFO). >>> >>> Use new allocator for logical decoding’s reorderbuffer >>> to significantly reduce memory usage and improve performance. >> >> Looks like it's not quite valgrind clean: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-11-22%2022%3A30%3A01 > > It doesn't report anything useful that would allow me to fix this. > > Please enable some info reporting on the animal. > I agree it would be useful to get the valgrind log from the animal, but I guess you'd need to get valgrind working to fix the issue anyway. Anyway, the attached patch fixes the issues for me - strictly speaking the Assert is not needed, but it doesn't hurt. regards Tomas
Attachment
On 24 November 2017 at 02:16, Tomas Vondra <tv@fuzzy.cz> wrote: > > > On 11/23/2017 10:57 AM, Simon Riggs wrote: >> >> On 23 November 2017 at 11:16, Andres Freund <andres@anarazel.de> wrote: >>> >>> Hi, >>> >>> On 2017-11-22 18:48:19 +0000, Simon Riggs wrote: >>>> >>>> Generational memory allocator >>>> >>>> Add new style of memory allocator, known as Generational >>>> appropriate for use in cases where memory is allocated >>>> and then freed in roughly oldest first order (FIFO). >>>> >>>> Use new allocator for logical decoding’s reorderbuffer >>>> to significantly reduce memory usage and improve performance. >>> >>> >>> Looks like it's not quite valgrind clean: >>> >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-11-22%2022%3A30%3A01 >> >> >> It doesn't report anything useful that would allow me to fix this. >> >> Please enable some info reporting on the animal. >> > > I agree it would be useful to get the valgrind log from the animal, but I > guess you'd need to get valgrind working to fix the issue anyway. If something fails compilation it tells us why, not just "compilation failed". The whole point of the buildfarm is to report errors to allow them to be fixed, not just a fence that blocks things. > Anyway, the attached patch fixes the issues for me - strictly speaking the > Assert is not needed, but it doesn't hurt. Thanks for the patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-11-23 20:57:10 +1100, Simon Riggs wrote: > On 23 November 2017 at 11:16, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > > > On 2017-11-22 18:48:19 +0000, Simon Riggs wrote: > >> Generational memory allocator > >> > >> Add new style of memory allocator, known as Generational > >> appropriate for use in cases where memory is allocated > >> and then freed in roughly oldest first order (FIFO). > >> > >> Use new allocator for logical decoding’s reorderbuffer > >> to significantly reduce memory usage and improve performance. > > > > Looks like it's not quite valgrind clean: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-11-22%2022%3A30%3A01 > > It doesn't report anything useful that would allow me to fix this. Uh. It reports a failure, and you can go from there. The error output actually is in postmaster.log but for some reason the buildfarm code didn't display that in this case. > Please enable some info reporting on the animal. Hey, I just pointed out for you that there's a problem with something that you committed. ISTM you're inverting the responsibilities more than a bit here. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-11-23 20:57:10 +1100, Simon Riggs wrote: >> On 23 November 2017 at 11:16, Andres Freund <andres@anarazel.de> wrote: >>> Looks like it's not quite valgrind clean: >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-11-22%2022%3A30%3A01 >> It doesn't report anything useful that would allow me to fix this. > Uh. It reports a failure, and you can go from there. The error output > actually is in postmaster.log but for some reason the buildfarm code > didn't display that in this case. I think it's a legitimate complaint that postmaster.log wasn't captured in this failure, but that's a buildfarm script oversight and hardly Andres' fault. In any case, valgrind failures are generally easy enough to reproduce locally. Meanwhile, over on https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2017-11-23%2013%3A56%3A17 we have ccache gcc-4.7 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O2 -fstack-protector--param=ssp-buffer-size=4 -Wformat -Werror=format-security -I../../../../src/include -D_FORTIFY_SOURCE=2-D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/mit-krb5 -c -o generation.o generation.c generation.c: In function ‘GenerationContextCreate’: generation.c:206:7: error: static assertion failed: "padding calculation in GenerationChunk is wrong" make[4]: *** [generation.o] Error 1 Looks to me like GenerationChunk imagines that 3*sizeof(pointer) must be a maxaligned quantity, which is wrong on platforms where MAXALIGN is bigger than sizeof(pointer). regards, tom lane
On 11/23/2017 09:06 PM, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2017-11-23 20:57:10 +1100, Simon Riggs wrote: >>> On 23 November 2017 at 11:16, Andres Freund <andres@anarazel.de> wrote: >>>> Looks like it's not quite valgrind clean: >>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-11-22%2022%3A30%3A01 > >>> It doesn't report anything useful that would allow me to fix this. > >> Uh. It reports a failure, and you can go from there. The error output >> actually is in postmaster.log but for some reason the buildfarm code >> didn't display that in this case. > It may not be immediately obvious that the failure is due to valgrind, but otherwise I agree it's up to us to investigate. > I think it's a legitimate complaint that postmaster.log wasn't captured > in this failure, but that's a buildfarm script oversight and hardly > Andres' fault. > Are the valgrind errors really written to postmaster log? I'm assuming it failed because valgrind ran into an issue and killed the process. > In any case, valgrind failures are generally easy enough to reproduce > locally. > Right. > Meanwhile, over on > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2017-11-23%2013%3A56%3A17 > > we have > > ccache gcc-4.7 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O2 -fstack-protector--param=ssp-buffer-size=4 -Wformat -Werror=format-security -I../../../../src/include -D_FORTIFY_SOURCE=2-D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/mit-krb5 -c -o generation.o generation.c > generation.c: In function ‘GenerationContextCreate’: > generation.c:206:7: error: static assertion failed: "padding calculation in GenerationChunk is wrong" > make[4]: *** [generation.o] Error 1 > > Looks to me like GenerationChunk imagines that 3*sizeof(pointer) > must be a maxaligned quantity, which is wrong on platforms where > MAXALIGN is bigger than sizeof(pointer). > Hmmm, I see. Presumably adding this to GenerationChunk (similarly to what we do in AllocChunkData) would address the issue: #if MAXIMUM_ALIGNOF > 4 && SIZEOF_VOID_P == 4Size padding; #endif but I have no way to verify that (no access to such machine). I wonder why SlabChunk doesn't need to do that (perhaps a comment explaining that would be appropriate?). regards Tomas
On 24 November 2017 at 07:06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Meanwhile, over on > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2017-11-23%2013%3A56%3A17 > > we have > > ccache gcc-4.7 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O2 -fstack-protector--param=ssp-buffer-size=4 -Wformat -Werror=format-security -I../../../../src/include -D_FORTIFY_SOURCE=2-D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/mit-krb5 -c -o generation.o generation.c > generation.c: In function ‘GenerationContextCreate’: > generation.c:206:7: error: static assertion failed: "padding calculation in GenerationChunk is wrong" > make[4]: *** [generation.o] Error 1 > > Looks to me like GenerationChunk imagines that 3*sizeof(pointer) > must be a maxaligned quantity, which is wrong on platforms where > MAXALIGN is bigger than sizeof(pointer). Thanks, yes. I can't see how to resolve that without breaking the design assumption at line 122, "there must not be any padding to reach a MAXALIGN boundary here!". So I'll wait for Tomas to comment. (I'm just about to catch a long flight, so will be offline for 24 hours, so we may need to revert this before I get back if it is difficult to resolve.) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-11-23 22:34:57 +0100, Tomas Vondra wrote: > > I think it's a legitimate complaint that postmaster.log wasn't captured > > in this failure, but that's a buildfarm script oversight and hardly > > Andres' fault. > > > > Are the valgrind errors really written to postmaster log? I'm assuming it > failed because valgrind ran into an issue and killed the process. Yes. Search e.g. in https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-09-18%2004%3A10%3A01 for VALGRINDERROR-BEGIN. (You could argue that they're only written there in certain configurations, because I'd assume it'd not work in e.g. syslog configured systems, because valgrind just prints to stderr). > Hmmm, I see. Presumably adding this to GenerationChunk (similarly to what we > do in AllocChunkData) would address the issue: > > #if MAXIMUM_ALIGNOF > 4 && SIZEOF_VOID_P == 4 > Size padding; > #endif > > but I have no way to verify that (no access to such machine). I wonder why > SlabChunk doesn't need to do that (perhaps a comment explaining that would > be appropriate?). Can't you just compile pg on a 32bit system and manually define MAXALIGN to 8 bytes? Greetings, Andres Freund
On 24 November 2017 at 06:39, Andres Freund <andres@anarazel.de> wrote: > On 2017-11-23 20:57:10 +1100, Simon Riggs wrote: >> On 23 November 2017 at 11:16, Andres Freund <andres@anarazel.de> wrote: >> > Hi, >> > >> > On 2017-11-22 18:48:19 +0000, Simon Riggs wrote: >> >> Generational memory allocator >> >> >> >> Add new style of memory allocator, known as Generational >> >> appropriate for use in cases where memory is allocated >> >> and then freed in roughly oldest first order (FIFO). >> >> >> >> Use new allocator for logical decoding’s reorderbuffer >> >> to significantly reduce memory usage and improve performance. >> > >> > Looks like it's not quite valgrind clean: >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-11-22%2022%3A30%3A01 >> >> It doesn't report anything useful that would allow me to fix this. > > Uh. It reports a failure, and you can go from there. The error output > actually is in postmaster.log but for some reason the buildfarm code > didn't display that in this case. >> Please enable some info reporting on the animal. > > Hey, I just pointed out for you that there's a problem with something > that you committed. ISTM you're inverting the responsibilities more than > a bit here. I don't believe I was inverting responsibilties, but sorry if you thought I was. Yes, there is a bug in something I committed which I know about because of the buildfarm. I am happy to work on fixing it. No discussion on responsibility there. I can see that buildfarm animal is yours, so I asked you for more info about the bug on that animal. ISTM reasonable to ask for some more detail on a bug report, and where that is an automated service to request that the owner of that service make changes to assist. I don't see why that implies some change of my responsibilities. Thanks for your help -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andres Freund <andres@anarazel.de> writes: > On 2017-11-23 22:34:57 +0100, Tomas Vondra wrote: >> Hmmm, I see. Presumably adding this to GenerationChunk (similarly to what we >> do in AllocChunkData) would address the issue: >> >> #if MAXIMUM_ALIGNOF > 4 && SIZEOF_VOID_P == 4 >> Size padding; >> #endif >> >> but I have no way to verify that (no access to such machine). I wonder why >> SlabChunk doesn't need to do that (perhaps a comment explaining that would >> be appropriate?). > Can't you just compile pg on a 32bit system and manually define MAXALIGN > to 8 bytes? I pushed a patch that computes how much padding to add and adds it. (It might not really work if size_t and void * are different sizes, because then there could be additional padding in the struct; but that seems very unlikely.) regards, tom lane
On 11/23/2017 11:04 PM, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2017-11-23 22:34:57 +0100, Tomas Vondra wrote: >>> Hmmm, I see. Presumably adding this to GenerationChunk (similarly to what we >>> do in AllocChunkData) would address the issue: >>> >>> #if MAXIMUM_ALIGNOF > 4 && SIZEOF_VOID_P == 4 >>> Size padding; >>> #endif >>> >>> but I have no way to verify that (no access to such machine). I wonder why >>> SlabChunk doesn't need to do that (perhaps a comment explaining that would >>> be appropriate?). > >> Can't you just compile pg on a 32bit system and manually define MAXALIGN >> to 8 bytes? > > I pushed a patch that computes how much padding to add and adds it. > (It might not really work if size_t and void * are different sizes, > because then there could be additional padding in the struct; but > that seems very unlikely.) > Thanks. Do we need to do something similar to the other memory contexts? I see Slab does not do this at all (assuming it's not necessary), and AllocSet does this in a different way (which seems a bit strange). regards Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > On 11/23/2017 11:04 PM, Tom Lane wrote: >> I pushed a patch that computes how much padding to add and adds it. >> (It might not really work if size_t and void * are different sizes, >> because then there could be additional padding in the struct; but >> that seems very unlikely.) > Thanks. Do we need to do something similar to the other memory contexts? > I see Slab does not do this at all (assuming it's not necessary), and > AllocSet does this in a different way (which seems a bit strange). Hm ... the coding in AllocSet is ancient and I have to say that I don't like it as well as what I put into generation.c. We could not have done it in that way at the time, because (IIRC) we did not have constant macros for SIZEOF_SIZE_T, and maybe not SIZEOF_VOID_P either --- and you need constant macros in order to do the #if calculation, because sizeof() does not work in preprocessor expressions. But we have 'em now, so I'm tempted to change aset.c to match generation.c. As for SlabChunk, I think that's fine as-is (except I wonder why the "block" field is declared "void *" rather than "SlabBlock *"). Since the contents are fixed at two pointer fields, it's impossible that there's any unexpected padding in there --- the struct size is basically guaranteed to be 2 * sizeof(pointer). Which makes it either 8 or 16 bytes, either one of which is certain to be a multiple of MAXALIGN. So I think the StaticAssert that that's true is plenty sufficient in that case. regards, tom lane
On 24 November 2017 at 09:04, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2017-11-23 22:34:57 +0100, Tomas Vondra wrote: >>> Hmmm, I see. Presumably adding this to GenerationChunk (similarly to what we >>> do in AllocChunkData) would address the issue: >>> >>> #if MAXIMUM_ALIGNOF > 4 && SIZEOF_VOID_P == 4 >>> Size padding; >>> #endif >>> >>> but I have no way to verify that (no access to such machine). I wonder why >>> SlabChunk doesn't need to do that (perhaps a comment explaining that would >>> be appropriate?). > >> Can't you just compile pg on a 32bit system and manually define MAXALIGN >> to 8 bytes? > > I pushed a patch that computes how much padding to add and adds it. > (It might not really work if size_t and void * are different sizes, > because then there could be additional padding in the struct; but > that seems very unlikely.) Oh, OK, thanks. It sunk in what was needed while flying, but that's better than my efforts. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/23/2017 03:06 PM, Tom Lane wrote: > > I think it's a legitimate complaint that postmaster.log wasn't captured > in this failure, but that's a buildfarm script oversight and hardly > Andres' fault. > A fix for this will be in the next buildfarm release. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tv@fuzzy.cz> writes: > I agree it would be useful to get the valgrind log from the animal, but > I guess you'd need to get valgrind working to fix the issue anyway. > Anyway, the attached patch fixes the issues for me - strictly speaking > the Assert is not needed, but it doesn't hurt. For me, this patch fixes the valgrind failures inside generation.c itself, but I still see one more in the test_decoding run: ==00:00:00:08.768 15475== Invalid read of size 8 ==00:00:00:08.768 15475== at 0x710BD0: SnapBuildProcessNewCid (snapbuild.c:780) ==00:00:00:08.768 15475== by 0x70291E: LogicalDecodingProcessRecord (decode.c:371) ==00:00:00:08.768 15475== by 0x705A04: pg_logical_slot_get_changes_guts (logicalfuncs.c:308) ==00:00:00:08.768 15475== by 0x6279F5: ExecMakeTableFunctionResult (execSRF.c:231) ==00:00:00:08.768 15475== by 0x634C41: FunctionNext (nodeFunctionscan.c:95) ==00:00:00:08.768 15475== by 0x626976: ExecScan (execScan.c:97) ==00:00:00:08.768 15475== by 0x622AF6: standard_ExecutorRun (executor.h:241) ==00:00:00:08.768 15475== by 0x76B07A: PortalRunSelect (pquery.c:932) ==00:00:00:08.768 15475== by 0x76C3E0: PortalRun (pquery.c:773) ==00:00:00:08.768 15475== by 0x7687EC: exec_simple_query (postgres.c:1120) ==00:00:00:08.768 15475== by 0x76A5C7: PostgresMain (postgres.c:4139) ==00:00:00:08.768 15475== by 0x6EFDC9: PostmasterMain (postmaster.c:4412) ==00:00:00:08.768 15475== Address 0xe7e846c is 28 bytes inside a block of size 34 client-defined ==00:00:00:08.768 15475== at 0x89A17F: palloc (mcxt.c:871) ==00:00:00:08.768 15475== by 0x51FF07: DecodeXLogRecord (xlogreader.c:1283) ==00:00:00:08.768 15475== by 0x520CD3: XLogReadRecord (xlogreader.c:475) ==00:00:00:08.768 15475== by 0x7059E4: pg_logical_slot_get_changes_guts (logicalfuncs.c:293) ==00:00:00:08.769 15475== by 0x6279F5: ExecMakeTableFunctionResult (execSRF.c:231) ==00:00:00:08.769 15475== by 0x634C41: FunctionNext (nodeFunctionscan.c:95) ==00:00:00:08.769 15475== by 0x626976: ExecScan (execScan.c:97) ==00:00:00:08.769 15475== by 0x622AF6: standard_ExecutorRun (executor.h:241) ==00:00:00:08.769 15475== by 0x76B07A: PortalRunSelect (pquery.c:932) ==00:00:00:08.769 15475== by 0x76C3E0: PortalRun (pquery.c:773) ==00:00:00:08.769 15475== by 0x7687EC: exec_simple_query (postgres.c:1120) ==00:00:00:08.769 15475== by 0x76A5C7: PostgresMain (postgres.c:4139) ==00:00:00:08.769 15475== Not sure what to make of this: the stack traces make it look unrelated to the GenerationContext changes, but if it's not related, how come skink was passing before that patch went in? I'm tempted to push Tomas' patch as-is, just to see if skink sees the same failure I do. BTW, I'm pretty certain that there are additional bugs in generation.c's valgrind support, in code paths that the regression tests likely do not reach. I do not think it's setting up valgrind correctly for "large" chunks, and it looks to me like GenerationCheck would fail because it tries to access chunk->requested_size which is generally kept NOACCESS. I'm tempted to rip out all of the logic concerned with flipping chunk->requested_size between normal and NOACCESS states, as that seems to me like an exercise much more likely to introduce bugs than to catch any. regards, tom lane
I wrote: > Tomas Vondra <tv@fuzzy.cz> writes: >> Thanks. Do we need to do something similar to the other memory contexts? >> I see Slab does not do this at all (assuming it's not necessary), and >> AllocSet does this in a different way (which seems a bit strange). > Hm ... the coding in AllocSet is ancient and I have to say that I don't > like it as well as what I put into generation.c. I take that back: the current coding of padding in AllocChunkData only dates back to 7e3aa03b. But I still don't like it, and will migrate generation.c's version into aset.c. And maybe improve the comments. BTW, it appears that some of the confusion in generation.c can be accounted for by not having entirely followed the changes in 7e3aa03b. regards, tom lane
I wrote: > For me, this patch fixes the valgrind failures inside generation.c > itself, but I still see one more in the test_decoding run: ... > Not sure what to make of this: the stack traces make it look unrelated > to the GenerationContext changes, but if it's not related, how come > skink was passing before that patch went in? I've pushed fixes for everything that I could find wrong in generation.c (and there was a lot :-(). But I'm still seeing the "invalid read in SnapBuildProcessNewCid" failure when I run test_decoding under valgrind. Somebody who has more familiarity with the logical decoding stuff than I do needs to look into that. I tried to narrow down exactly which fetch in SnapBuildProcessNewCid was triggering the failure, with the attached patch. Weirdly, *it does not fail* with this. I have no explanation for that. regards, tom lane diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index ad65b98..a48db7e 100644 *** a/src/backend/replication/logical/snapbuild.c --- b/src/backend/replication/logical/snapbuild.c *************** SnapBuildProcessNewCid(SnapBuild *builde *** 770,775 **** --- 770,781 ---- XLogRecPtr lsn, xl_heap_new_cid *xlrec) { CommandId cid; + TransactionId top_xid; + RelFileNode node; + ItemPointerData tid; + CommandId cmin; + CommandId cmax; + CommandId combocid; /* * we only log new_cid's if a catalog tuple was modified, so mark the *************** SnapBuildProcessNewCid(SnapBuild *builde *** 777,786 **** */ ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn); ! ReorderBufferAddNewTupleCids(builder->reorder, xlrec->top_xid, lsn, ! xlrec->target_node, xlrec->target_tid, ! xlrec->cmin, xlrec->cmax, ! xlrec->combocid); /* figure out new command id */ if (xlrec->cmin != InvalidCommandId && --- 783,799 ---- */ ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn); ! top_xid = xlrec->top_xid; ! node = xlrec->target_node; ! tid = xlrec->target_tid; ! cmin = xlrec->cmin; ! cmax = xlrec->cmax; ! combocid = xlrec->combocid; ! ! ReorderBufferAddNewTupleCids(builder->reorder, top_xid, lsn, ! node, tid, ! cmin, cmax, ! combocid); /* figure out new command id */ if (xlrec->cmin != InvalidCommandId &&
Hi, On 11/25/2017 02:25 AM, Tom Lane wrote: > I wrote: >> For me, this patch fixes the valgrind failures inside generation.c >> itself, but I still see one more in the test_decoding run: ... >> Not sure what to make of this: the stack traces make it look unrelated >> to the GenerationContext changes, but if it's not related, how come >> skink was passing before that patch went in? > > I've pushed fixes for everything that I could find wrong in generation.c > (and there was a lot :-(). But I'm still seeing the "invalid read in > SnapBuildProcessNewCid" failure when I run test_decoding under valgrind. > Somebody who has more familiarity with the logical decoding stuff than > I do needs to look into that. > > I tried to narrow down exactly which fetch in SnapBuildProcessNewCid was > triggering the failure, with the attached patch. Weirdly, *it does not > fail* with this. I have no explanation for that. > I have no explanation for that either. FWIW I don't think this is related to the new memory contexts. I can reproduce it on 3bae43c (i.e. before the Generation memory context was introduced), and with Slab removed from ReorderBuffer. I wonder if this might be a valgrind issue. I'm not sure which version skink is using, but I'm running with valgrind-3.12.0-9.el7_4.x86_64. BTW I also see these failures in hstore: ==15168== Source and destination overlap in memcpy(0x5d0fed0, 0x5d0fed0, 40) ==15168== at 0x4C2E00C: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1018) ==15168== by 0x15419A06: hstoreUniquePairs (hstore_io.c:343) ==15168== by 0x15419EE4: hstore_in (hstore_io.c:416) ==15168== by 0x9ED11A: InputFunctionCall (fmgr.c:1635) ==15168== by 0x9ED3C2: OidInputFunctionCall (fmgr.c:1738) ==15168== by 0x6014A2: stringTypeDatum (parse_type.c:641) ==15168== by 0x5E1ADC: coerce_type (parse_coerce.c:304) ==15168== by 0x5E17A9: coerce_to_target_type (parse_coerce.c:103) ==15168== by 0x5EDD6D: transformTypeCast (parse_expr.c:2724) ==15168== by 0x5E8860: transformExprRecurse (parse_expr.c:203) ==15168== by 0x5E8601: transformExpr (parse_expr.c:156) ==15168== by 0x5FCF95: transformTargetEntry (parse_target.c:103) ==15168== by 0x5FD15D: transformTargetList (parse_target.c:191) ==15168== by 0x5A5EEC: transformSelectStmt (analyze.c:1214) ==15168== by 0x5A4453: transformStmt (analyze.c:297) ==15168== by 0x5A4381: transformOptionalSelectInto (analyze.c:242) ==15168== by 0x5A423F: transformTopLevelStmt (analyze.c:192) ==15168== by 0x5A4097: parse_analyze (analyze.c:112) ==15168== by 0x87E0AF: pg_analyze_and_rewrite (postgres.c:664) ==15168== by 0x87E6EE: exec_simple_query (postgres.c:1045) Seems hstoreUniquePairs may call memcpy with the same pointers in some cases (which looks a bit dubious). But the code is ancient, so it's strange it didn't fail before. regards Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > BTW I also see these failures in hstore: > ==15168== Source and destination overlap in memcpy(0x5d0fed0, 0x5d0fed0, 40) > ==15168== at 0x4C2E00C: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1018) > ==15168== by 0x15419A06: hstoreUniquePairs (hstore_io.c:343) > ==15168== by 0x15419EE4: hstore_in (hstore_io.c:416) Huh ... > Seems hstoreUniquePairs may call memcpy with the same pointers in some > cases (which looks a bit dubious). But the code is ancient, so it's > strange it didn't fail before. Quite. It's easy to see how to avoid the nominally-undefined behavior: - memcpy(res, ptr, sizeof(Pairs)); + if (res != ptr) + memcpy(res, ptr, sizeof(Pairs)); but this should surely have been noticed by valgrind tests before. The case doesn't necessarily occur --- if the first two items in sorted order are dups, it won't --- but if you're seeing it occur in regression testing then skink should have seen it as well. regards, tom lane
I wrote: > Tomas Vondra <tv@fuzzy.cz> writes: >> BTW I also see these failures in hstore: >> ==15168== Source and destination overlap in memcpy(0x5d0fed0, 0x5d0fed0, 40) >> ==15168== at 0x4C2E00C: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1018) >> ==15168== by 0x15419A06: hstoreUniquePairs (hstore_io.c:343) >> ==15168== by 0x15419EE4: hstore_in (hstore_io.c:416) > Huh ... I tried to duplicate this on my RHEL6 workstation, and failed to, even though adding an assertion easily proves that the hstore regression test does exercise the case. So apparently the answer as to why skink isn't reporting this is just "not all versions of valgrind check it". I checked the git history and noted that we've previously fixed other nominally-undefined usage of memcpy() on the grounds that OpenBSD's libc will complain about it. So it seems like something that would be good to fix, and I did so. Meanwhile, skink has come back with a success as of 0f2458f, rendering the other mystery even deeper --- although at least this confirms the idea that the crashes we are seeing predate the generation.c patch. regards, tom lane
Hi, On 2017-11-25 14:50:41 -0500, Tom Lane wrote: > I wrote: > > Tomas Vondra <tv@fuzzy.cz> writes: > >> BTW I also see these failures in hstore: > > >> ==15168== Source and destination overlap in memcpy(0x5d0fed0, 0x5d0fed0, 40) > >> ==15168== at 0x4C2E00C: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1018) > >> ==15168== by 0x15419A06: hstoreUniquePairs (hstore_io.c:343) > >> ==15168== by 0x15419EE4: hstore_in (hstore_io.c:416) > > > Huh ... > > I tried to duplicate this on my RHEL6 workstation, and failed to, > even though adding an assertion easily proves that the hstore > regression test does exercise the case. So apparently the answer > as to why skink isn't reporting this is just "not all versions of > valgrind check it". I suspect that the issue rather is that the compiler will sometimes replace the memcpy() with an in-line member-by-member version. That'll not be visible as a memcpy to valgrind. > Meanwhile, skink has come back with a success as of 0f2458f, rendering > the other mystery even deeper --- although at least this confirms the > idea that the crashes we are seeing predate the generation.c patch. Hm, wonder whether that could be optimization dependent too. Greetings, Andres Freund
On 2017-11-25 12:00:15 -0800, Andres Freund wrote: > Hi, > > On 2017-11-25 14:50:41 -0500, Tom Lane wrote: > > I wrote: > > > Tomas Vondra <tv@fuzzy.cz> writes: > > >> BTW I also see these failures in hstore: > > > > >> ==15168== Source and destination overlap in memcpy(0x5d0fed0, 0x5d0fed0, 40) > > >> ==15168== at 0x4C2E00C: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1018) > > >> ==15168== by 0x15419A06: hstoreUniquePairs (hstore_io.c:343) > > >> ==15168== by 0x15419EE4: hstore_in (hstore_io.c:416) > > > > > Huh ... > > > > I tried to duplicate this on my RHEL6 workstation, and failed to, > > even though adding an assertion easily proves that the hstore > > regression test does exercise the case. So apparently the answer > > as to why skink isn't reporting this is just "not all versions of > > valgrind check it". > > I suspect that the issue rather is that the compiler will sometimes > replace the memcpy() with an in-line member-by-member version. That'll > not be visible as a memcpy to valgrind. That's indeed the case. Here's the disassembly from skink, albeit for v10, because those objects were currently present: disassemble /s hstoreUniquePairs ... 342 res++; 0x00000000000005c2 <+174>: add $0x28,%rbx 343 memcpy(res, ptr, sizeof(Pairs)); 0x00000000000005c6 <+178>: mov (%r12),%rax 0x00000000000005ca<+182>: mov 0x8(%r12),%rdx 0x00000000000005cf <+187>: mov %rax,(%rbx) 0x00000000000005d2 <+190>: mov %rdx,0x8(%rbx) 0x00000000000005d6 <+194>: mov 0x10(%r12),%rax 0x00000000000005db <+199>: mov 0x18(%r12),%rdx 0x00000000000005e0 <+204>: mov %rax,0x10(%rbx) 0x00000000000005e4 <+208>: mov %rdx,0x18(%rbx) 0x00000000000005e8 <+212>: mov 0x20(%r12),%rax 0x00000000000005ed <+217>: mov %rax,0x20(%rbx) 344 } 345 ... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-11-25 14:50:41 -0500, Tom Lane wrote: >> Meanwhile, skink has come back with a success as of 0f2458f, rendering >> the other mystery even deeper --- although at least this confirms the >> idea that the crashes we are seeing predate the generation.c patch. > Hm, wonder whether that could be optimization dependent too. Evidently :-(. I examined my compiler's assembly output for the relevant line, snapbuild.c:780: ReorderBufferAddNewTupleCids(builder->reorder, xlrec->top_xid, lsn, xlrec->target_node,xlrec->target_tid, xlrec->cmin, xlrec->cmax, xlrec->combocid); which is movl 12(%rbx), %eax combocidmovq 16(%rbx), %rcx target_node.spcNode + dbNodemovq %rbp, %rdxmovl 24(%rbx), %r8d target_node.relNodemovq 56(%r12), %rdimovq 28(%rbx), %r9 target_tid ... 8 bytes worthmovl (%rbx), %esi top_xidmovl %eax, 16(%rsp)movl 8(%rbx), %eax cmaxmovl %eax, 8(%rsp)movl 4(%rbx), %eax cminmovl %eax, (%rsp)call ReorderBufferAddNewTupleCids I've annotated the fetches from *rbx to match the data structure: typedef struct xl_heap_new_cid { /* * store toplevel xid so we don't have to merge cids from different * transactions */ TransactionId top_xid; CommandId cmin; CommandId cmax; /* * don't really need the combocid since we have the actual values right in * this struct, but the padding makesit free and its useful for * debugging. */ CommandId combocid; /* * Store the relfilenode/ctid pair to facilitate lookups. */ RelFileNode target_node; ItemPointerData target_tid; } xl_heap_new_cid; and what's apparent is that it's grabbing 8 bytes not just 6 from target_tid. So the failure case must occur when the xl_heap_new_cid WAL record is right at the end of the palloc'd record buffer and valgrind notices that we're fetching padding bytes. I suspect that gcc feels within its rights to do this because the size/alignment of xl_heap_new_cid is a multiple of 4 bytes, so that there ought to be 2 padding bytes at the end. We could possibly prevent that by marking the whole xl_heap_new_cid struct as packed/align(2), the same way ItemPointerData is marked, but that would render accesses to all of its fields inefficient on alignment-sensitive machines. Moreover, if we go that route, we have a boatload of other xlog record formats with similar hazards. Instead I propose that we should make sure that the palloc request size for XLogReaderState->main_data is always maxalign'd. The existing behavior in DecodeXLogRecord of palloc'ing it only just barely big enough for the current record seems pretty brain-dead performance-wise even without this consideration. Generally, if we need to enlarge that buffer, we should enlarge it significantly, IMO. BTW, that comment in the middle of xl_heap_new_cid about "padding makes this field free" seems entirely wrong. By my count the struct is currently 34 bytes, so if by padding it's talking about maxalign alignment of the whole struct, that field is costing us 8 bytes on maxalign-8 machines. There's certainly no internal alignment that would make it free. regards, tom lane
I wrote: > Instead I propose that we should make sure that the palloc request size > for XLogReaderState->main_data is always maxalign'd. The existing > behavior in DecodeXLogRecord of palloc'ing it only just barely big > enough for the current record seems pretty brain-dead performance-wise > even without this consideration. Generally, if we need to enlarge > that buffer, we should enlarge it significantly, IMO. I've confirmed that the attached is sufficient to stop the valgrind crash on my machine. But as I said, I think we should be more aggressive at resizing the buffer, to reduce resize cycles. I'm inclined to start out with a buffer size of 128 or 256 or so bytes and double it when needed. Anybody have a feeling for a typical size for the "main data" part of a WAL record? regards, tom lane diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index aeaafed..4840b0e 100644 *** a/src/backend/access/transam/xlogreader.c --- b/src/backend/access/transam/xlogreader.c *************** DecodeXLogRecord(XLogReaderState *state, *** 1279,1285 **** { if (state->main_data) pfree(state->main_data); ! state->main_data_bufsz = state->main_data_len; state->main_data = palloc(state->main_data_bufsz); } memcpy(state->main_data, ptr, state->main_data_len); --- 1279,1285 ---- { if (state->main_data) pfree(state->main_data); ! state->main_data_bufsz = MAXALIGN(state->main_data_len); state->main_data = palloc(state->main_data_bufsz); } memcpy(state->main_data, ptr, state->main_data_len);
On 26 November 2017 at 08:46, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Instead I propose that we should make sure that the palloc request size >> for XLogReaderState->main_data is always maxalign'd. The existing >> behavior in DecodeXLogRecord of palloc'ing it only just barely big >> enough for the current record seems pretty brain-dead performance-wise >> even without this consideration. Generally, if we need to enlarge >> that buffer, we should enlarge it significantly, IMO. > > I've confirmed that the attached is sufficient to stop the valgrind crash > on my machine. But as I said, I think we should be more aggressive at > resizing the buffer, to reduce resize cycles. I'm inclined to start out > with a buffer size of 128 or 256 or so bytes and double it when needed. > Anybody have a feeling for a typical size for the "main data" part > of a WAL record? We reuse the buffer and only pfree/palloc when we need to enlarge the buffer, so not sure we need to do the doubling thing and it probably doesn't matter what the typical size is. So I think we're just good to go with your patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 25 November 2017 at 02:35, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > > > On 11/23/2017 03:06 PM, Tom Lane wrote: >> >> I think it's a legitimate complaint that postmaster.log wasn't captured >> in this failure, but that's a buildfarm script oversight and hardly >> Andres' fault. >> > > A fix for this will be in the next buildfarm release. Thanks -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 25 November 2017 at 12:25, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> For me, this patch fixes the valgrind failures inside generation.c >> itself, but I still see one more in the test_decoding run: ... >> Not sure what to make of this: the stack traces make it look unrelated >> to the GenerationContext changes, but if it's not related, how come >> skink was passing before that patch went in? > > I've pushed fixes for everything that I could find wrong in generation.c > (and there was a lot :-(). Thanks very much for doing that. I'll review changes and comment. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs <simon@2ndquadrant.com> writes: > On 26 November 2017 at 08:46, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I've confirmed that the attached is sufficient to stop the valgrind crash >> on my machine. But as I said, I think we should be more aggressive at >> resizing the buffer, to reduce resize cycles. I'm inclined to start out >> with a buffer size of 128 or 256 or so bytes and double it when needed. >> Anybody have a feeling for a typical size for the "main data" part >> of a WAL record? > We reuse the buffer and only pfree/palloc when we need to enlarge the > buffer, so not sure we need to do the doubling thing and it probably > doesn't matter what the typical size is. Well, I'm concerned about the possibility of a lot of palloc thrashing if the first bunch of records it reads happen to have steadily increasing sizes. However, rather than doubling, it might be sufficient to set a robust minimum on the first allocation, ie use something along the lines of Max(1024, MAXALIGN(state->main_data_len)). regards, tom lane
On 27 November 2017 at 04:46, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> On 26 November 2017 at 08:46, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I've confirmed that the attached is sufficient to stop the valgrind crash >>> on my machine. But as I said, I think we should be more aggressive at >>> resizing the buffer, to reduce resize cycles. I'm inclined to start out >>> with a buffer size of 128 or 256 or so bytes and double it when needed. >>> Anybody have a feeling for a typical size for the "main data" part >>> of a WAL record? > >> We reuse the buffer and only pfree/palloc when we need to enlarge the >> buffer, so not sure we need to do the doubling thing and it probably >> doesn't matter what the typical size is. > > Well, I'm concerned about the possibility of a lot of palloc thrashing > if the first bunch of records it reads happen to have steadily increasing > sizes. However, rather than doubling, it might be sufficient to set a > robust minimum on the first allocation, ie use something along the lines > of Max(1024, MAXALIGN(state->main_data_len)). Agreed. I was just researching what that number should be... and I was thinking that we should use the maximum normal tuple size, which I think is TOAST_TUPLE_THRESHOLD + SizeOfXLogRecord + SizeOfXLogRecordDataHeaderLong -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs <simon@2ndquadrant.com> writes: > On 27 November 2017 at 04:46, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, I'm concerned about the possibility of a lot of palloc thrashing >> if the first bunch of records it reads happen to have steadily increasing >> sizes. However, rather than doubling, it might be sufficient to set a >> robust minimum on the first allocation, ie use something along the lines >> of Max(1024, MAXALIGN(state->main_data_len)). > Agreed. > I was just researching what that number should be... and I was > thinking that we should use the maximum normal tuple size, which I > think is > TOAST_TUPLE_THRESHOLD + > SizeOfXLogRecord + > SizeOfXLogRecordDataHeaderLong Well, let's not overthink this, because anything under 8K is going to be rounded up to the next power of 2 anyway by aset.c. Based on this point I'd say that BLCKSZ/2 or BLCKSZ/4 would be reasonable candidates for the minimum. regards, tom lane
On 27 November 2017 at 05:53, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> On 27 November 2017 at 04:46, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Well, I'm concerned about the possibility of a lot of palloc thrashing >>> if the first bunch of records it reads happen to have steadily increasing >>> sizes. However, rather than doubling, it might be sufficient to set a >>> robust minimum on the first allocation, ie use something along the lines >>> of Max(1024, MAXALIGN(state->main_data_len)). > >> Agreed. > >> I was just researching what that number should be... and I was >> thinking that we should use the maximum normal tuple size, which I >> think is > >> TOAST_TUPLE_THRESHOLD + >> SizeOfXLogRecord + >> SizeOfXLogRecordDataHeaderLong > > Well, let's not overthink this, because anything under 8K is going to > be rounded up to the next power of 2 anyway by aset.c. Based on this > point I'd say that BLCKSZ/2 or BLCKSZ/4 would be reasonable candidates > for the minimum. BLCKSZ/2 seems best then. I guess that means palloc doesn't work well with BLCKSZ > 8192 -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs <simon@2ndquadrant.com> writes: > On 27 November 2017 at 05:53, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, let's not overthink this, because anything under 8K is going to >> be rounded up to the next power of 2 anyway by aset.c. Based on this >> point I'd say that BLCKSZ/2 or BLCKSZ/4 would be reasonable candidates >> for the minimum. > BLCKSZ/2 seems best then. Sold, will make it so. > I guess that means palloc doesn't work well with BLCKSZ > 8192 Don't think it's particularly related. regards, tom lane
On 27 November 2017 at 06:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> On 27 November 2017 at 05:53, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Well, let's not overthink this, because anything under 8K is going to >>> be rounded up to the next power of 2 anyway by aset.c. Based on this >>> point I'd say that BLCKSZ/2 or BLCKSZ/4 would be reasonable candidates >>> for the minimum. > >> BLCKSZ/2 seems best then. > > Sold, will make it so. So you will like this next patch also, since there is related code above that stanza. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Simon Riggs <simon@2ndquadrant.com> writes: > On 27 November 2017 at 06:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Sold, will make it so. > So you will like this next patch also, since there is related code > above that stanza. Looks reasonable to me, but I wonder whether BLCKSZ is a good initial setting. Since the data length is only uint16, we're apparently assuming that the contents must be small. regards, tom lane
On 27 November 2017 at 09:06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> On 27 November 2017 at 06:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Sold, will make it so. > >> So you will like this next patch also, since there is related code >> above that stanza. > > Looks reasonable to me, but I wonder whether BLCKSZ is a good initial > setting. Since the data length is only uint16, we're apparently assuming > that the contents must be small. Our max block size is 2^15 so I think we're good. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services