Thread: atomic pin/unpin causing errors
I've bisected the errors I was seeing, discussed in http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com It look like they first appear in: commit 48354581a49c30f5757c203415aa8412d85b0f70 Author: Andres Freund <andres@anarazel.de> Date: Sun Apr 10 20:12:32 2016 -0700 Allow Pin/UnpinBuffer to operate in a lockfree manner. I get the errors: ERROR: attempted to delete invisible tuple STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2 And also: ERROR: unexpected chunk number 1 (expected 2) for toast value 85223889 in pg_toast_16424 STATEMENT: update foo set count=count+1 where text_array @> $1 Once these errors start occurring, they happen often. Usually the "attempted to delete invisible tuple" happens first. These errors show up after about 9 hours of run time. The timing is predictable enough that I don't think it is a purely stochastic race condition. It seems like some counter variable is overflowing. But it is not the ShmemVariableCache->nextXid counter, as I previously speculated. This test does not advance that fast enough to for it to wrap around within 9 hours of run time. But I am at a loss of what other variable it might be. Since the system goes through a crash and recovery every few seconds, any backend-local counters or shared-memory counters would get reset upon recovery. Right? I think the invisible tuple referred to might be a tuple in the toast table, not in the parent table. I don't see the problem with an cassert-enabled, probably because it is just too slow to ever reach the point where the problem occurs. Any suggestions about where or how to look? I don't know if the "attempted to delete invisible tuple" is the bug itself, or is just tripping over corruption left behind by someone else. (This was all run using Teodor's test-enabling patch gin_alone_cleanup-4.patch, so as not to change horses in midstream. Now that a version of that patch has been committed, I will try to repeat this in HEAD) Cheers, Jeff
Hi, On 2016-04-29 10:38:55 -0700, Jeff Janes wrote: > I've bisected the errors I was seeing, discussed in > http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com > > It look like they first appear in: > > commit 48354581a49c30f5757c203415aa8412d85b0f70 > Author: Andres Freund <andres@anarazel.de> > Date: Sun Apr 10 20:12:32 2016 -0700 > > Allow Pin/UnpinBuffer to operate in a lockfree manner. > > > I get the errors: > > ERROR: attempted to delete invisible tuple > STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2 > > And also: > > ERROR: unexpected chunk number 1 (expected 2) for toast value > 85223889 in pg_toast_16424 > STATEMENT: update foo set count=count+1 where text_array @> $1 > > Once these errors start occurring, they happen often. Usually the > "attempted to delete invisible tuple" happens first. That kind of seems to implicate clog/vacuuming or something like that being involved. > These errors show up after about 9 hours of run time. The timing is > predictable enough that I don't think it is a purely stochastic race > condition. Hm. I've a bit of a hard time believing that such a timing could be caused by the above patch. How confident that it's that patch, and not just changed timing due to performance changes? And you definitely can only reproduce the problem with the regular crash cycles? > It seems like some counter variable is overflowing. But > it is not the ShmemVariableCache->nextXid counter, as I previously > speculated. This test does not advance that fast enough to for it to > wrap around within 9 hours of run time. But I am at a loss of what > other variable it might be. Since the system goes through a crash and > recovery every few seconds, any backend-local counters or > shared-memory counters would get reset upon recovery. Right? A lot of those counters will be re-set based on WAL contents. So if they're corrupted once, several of them are prone to continue to be corrupted. Greetings, Andres Freund
Hi Jeff, On 2016-04-29 10:38:55 -0700, Jeff Janes wrote: > I've bisected the errors I was seeing, discussed in > http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com > > It look like they first appear in: > > commit 48354581a49c30f5757c203415aa8412d85b0f70 > Author: Andres Freund <andres@anarazel.de> > Date: Sun Apr 10 20:12:32 2016 -0700 > > Allow Pin/UnpinBuffer to operate in a lockfree manner. > > > I get the errors: > > ERROR: attempted to delete invisible tuple > STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2 > > And also: > > ERROR: unexpected chunk number 1 (expected 2) for toast value > 85223889 in pg_toast_16424 > STATEMENT: update foo set count=count+1 where text_array @> $1 Hm. I appear to have trouble reproducing this issue (continuing to try) on master as of 8826d8507. Is there any chance you could package up a data directory after the issue hit? > (This was all run using Teodor's test-enabling patch > gin_alone_cleanup-4.patch, so as not to change horses in midstream. > Now that a version of that patch has been committed, I will try to > repeat this in HEAD) Any news on that front? Greetings, Andres Freund
On Wed, May 4, 2016 at 2:05 AM, Andres Freund <andres@anarazel.de> wrote:
On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> I've bisected the errors I was seeing, discussed in
> http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
>
> It look like they first appear in:
>
> commit 48354581a49c30f5757c203415aa8412d85b0f70
> Author: Andres Freund <andres@anarazel.de>
> Date: Sun Apr 10 20:12:32 2016 -0700
>
> Allow Pin/UnpinBuffer to operate in a lockfree manner.
>
>
> I get the errors:
>
> ERROR: attempted to delete invisible tuple
> STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2
>
> And also:
>
> ERROR: unexpected chunk number 1 (expected 2) for toast value
> 85223889 in pg_toast_16424
> STATEMENT: update foo set count=count+1 where text_array @> $1
Hm. I appear to have trouble reproducing this issue (continuing to try)
on master as of 8826d8507. Is there any chance you could package up a
data directory after the issue hit?
FWIW, I'm also trying to reproduce it on big x86 machine on 9888b34f.
I'll write about results when get any.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
>> I get the errors: >> >> ERROR: attempted to delete invisible tuple >> STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2 >> >> And also: >> >> ERROR: unexpected chunk number 1 (expected 2) for toast value >> 85223889 in pg_toast_16424 >> STATEMENT: update foo set count=count+1 where text_array @> $1 > > Hm. I appear to have trouble reproducing this issue (continuing to try) > on master as of 8826d8507. Is there any chance you could package up a > data directory after the issue hit? I've got ERROR: unexpected chunk number 0 (expected 1) for toast value 10192986 in pg_toast_16424 The test required 10 hours to run on my notebook. postgresql was compiled with -O0 --enable-debug --enable-cassert. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On 2016-05-04 18:12:45 +0300, Teodor Sigaev wrote: > > > I get the errors: > > > > > > ERROR: attempted to delete invisible tuple > > > STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2 > > > > > > And also: > > > > > > ERROR: unexpected chunk number 1 (expected 2) for toast value > > > 85223889 in pg_toast_16424 > > > STATEMENT: update foo set count=count+1 where text_array @> $1 > > > > Hm. I appear to have trouble reproducing this issue (continuing to try) > > on master as of 8826d8507. Is there any chance you could package up a > > data directory after the issue hit? > > I've got > ERROR: unexpected chunk number 0 (expected 1) for toast value 10192986 in > pg_toast_16424 > > The test required 10 hours to run on my notebook. postgresql was compiled > with -O0 --enable-debug --enable-cassert. Interesting. I just ran a test for a good bit longer, till it failed due to an nearing wraparound. Presumably because the crashes are too frequent to finish vacuuming. I did however, because Jeff said he coulnd't reproduce with cassert, use an optimized build. Wonder if there's some barrier related issue, making this dependant on the compiler's exact code generation. That'd explain why different people can reproduce it in different circumstances. Any chance you could package up that data directory for me to download? Andres
On Tue, May 3, 2016 at 4:05 PM, Andres Freund <andres@anarazel.de> wrote: > Hi Jeff, > > On 2016-04-29 10:38:55 -0700, Jeff Janes wrote: >> I've bisected the errors I was seeing, discussed in >> http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com >> >> It look like they first appear in: >> >> commit 48354581a49c30f5757c203415aa8412d85b0f70 >> Author: Andres Freund <andres@anarazel.de> >> Date: Sun Apr 10 20:12:32 2016 -0700 >> >> Allow Pin/UnpinBuffer to operate in a lockfree manner. >> >> >> I get the errors: >> >> ERROR: attempted to delete invisible tuple >> STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2 >> >> And also: >> >> ERROR: unexpected chunk number 1 (expected 2) for toast value >> 85223889 in pg_toast_16424 >> STATEMENT: update foo set count=count+1 where text_array @> $1 > > Hm. I appear to have trouble reproducing this issue (continuing to try) > on master as of 8826d8507. Is there any chance you could package up a > data directory after the issue hit? I'll look into. I haven't been saving them, as they are very large (tens of GB) by the time the errors happen. In case I can't find a way to transfer that much data, is there something I could do in situ to debug it? > > >> (This was all run using Teodor's test-enabling patch >> gin_alone_cleanup-4.patch, so as not to change horses in midstream. >> Now that a version of that patch has been committed, I will try to >> repeat this in HEAD) > > Any news on that front? I couldn't reproduce it in 82881b2b432c9433b45a (which is what HEAD was at the time). The last commit I saw the problem in was 8f1911d5e6d5, and in that commit it took longer than usual to see the error, and I never saw at all in one run (which lead me down the wrong path in git bisect) but then saw errors upon another try. Up until that commit, it seemed to give the errors like clockwork, always after 8 to 10 hours of running. I also have never seen the errors with the crashing turned off. I even tried it with autovac off and autovacuum_freeze_max_age=1500000000 (to emulate the way vacuum never gets a chance to run to completion in the crashing mode) and then I don't get any errors up to the point I run out of disk space. Cheers, Jeff
Hi Jeff, On 2016-05-04 14:00:01 -0700, Jeff Janes wrote: > On Tue, May 3, 2016 at 4:05 PM, Andres Freund <andres@anarazel.de> wrote: > > Hm. I appear to have trouble reproducing this issue (continuing to try) > > on master as of 8826d8507. Is there any chance you could package up a > > data directory after the issue hit? > > I'll look into. I haven't been saving them, as they are very large > (tens of GB) by the time the errors happen. Hm. Any chance that's SSH accessible? What compiler-version & OS are you using, with what exact CFLAGS/configure input? I'd like to try to replicate the setup as close as possible; in the hope of just making it locally reproducible. > In case I can't find a way to transfer that much data, is there > something I could do in situ to debug it? Yes. It'd be good to get a look at the borked page/tuple with pageinspect. That might require some manual search to find the affected tuple, and possibly the problem is transient. I was wondering whether we could just put an Assert() into those error messages, to get a stack dump. But unfortunately your tooling would likely generate far too many of those. Andres
> Any chance you could package up that data directory for me to download? Sent by personal email to Alexander, Andres and Jeff In /var/log/message I found May 4 22:04:07 xor kernel: pid 14010 (postgres), uid 1001: exited on signal 6 (core dumped) May 4 22:04:25 xor kernel: pid 14032 (postgres), uid 1001: exited on signal 11 (core dumped) May 4 22:04:52 xor kernel: pid 14037 (postgres), uid 1001: exited on signal 6 (core dumped) Sometimes postgres is crashed with SIGSEGV signal instead of SIGABRT (which comes form abort() in assert) I'll try to get a coredump after SIGSEGV, but it could take a time. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> I'll try to get a coredump after SIGSEGV, but it could take a time. Got it! #0 0x00000008014321d7 in sbrk () from /lib/libc.so.7 #1 0x0000000801431ddd in sbrk () from /lib/libc.so.7 #2 0x000000080142e5bb in sbrk () from /lib/libc.so.7 #3 0x000000080142e085 in sbrk () from /lib/libc.so.7 #4 0x000000080142de28 in sbrk () from /lib/libc.so.7 #5 0x000000080142e1cf in sbrk () from /lib/libc.so.7 #6 0x0000000801439815 in free () from /lib/libc.so.7 #7 0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7 #8 0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7 #9 0x000000080144525c in exit () from /lib/libc.so.7 #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623 #11 <signal handler called> #12 0x0000000801431847 in sbrk () from /lib/libc.so.7 #13 0x0000000801431522 in sbrk () from /lib/libc.so.7 #14 0x000000080142d47f in sbrk () from /lib/libc.so.7 #15 0x0000000801434628 in malloc () from /lib/libc.so.7 #16 0x0000000000aca278 in AllocSetAlloc (context=0x801c0bb88, size=24) at aset.c:853 #17 0x0000000000acca0e in MemoryContextAlloc (context=0x801c0bb88, size=24) at mcxt.c:764 #18 0x0000000000aebdb8 in PushActiveSnapshot (snap=0xf4ae10) at snapmgr.c:652 #19 0x00000000008e54bd in exec_bind_message (input_message=0x7fffffffdf60) at postgres.c:1602 #20 0x00000000008e3957 in PostgresMain (argc=1, argv=0x801d3c968, dbname=0x801d3c948 "teodor", username=0x801d3c928 "teodor") at postgres.c:4105 #21 0x0000000000839744 in BackendRun (port=0x801c991c0) at postmaster.c:4258 #22 0x0000000000838d54 in BackendStartup (port=0x801c991c0) at postmaster.c:3932 #23 0x0000000000835617 in ServerLoop () at postmaster.c:1690 #24 0x0000000000832c69 in PostmasterMain (argc=4, argv=0x7fffffffe420) at postmaster.c:1298 #25 0x000000000075f228 in main (argc=4, argv=0x7fffffffe420) at main.c:228 Seems, we have some memory corruption, but it could either separate or the same problem. Another one: #0 0x00000008014321d7 in sbrk () from /lib/libc.so.7 #1 0x0000000801431ddd in sbrk () from /lib/libc.so.7 #2 0x000000080142e5bb in sbrk () from /lib/libc.so.7 #3 0x000000080142e085 in sbrk () from /lib/libc.so.7 #4 0x000000080142de28 in sbrk () from /lib/libc.so.7 #5 0x000000080142e1cf in sbrk () from /lib/libc.so.7 #6 0x0000000801439815 in free () from /lib/libc.so.7 #7 0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7 #8 0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7 #9 0x000000080144525c in exit () from /lib/libc.so.7 #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623 #11 <signal handler called> #12 0x000000080143277a in sbrk () from /lib/libc.so.7 #13 0x00000008014318b5 in sbrk () from /lib/libc.so.7 #14 0x000000080142e483 in sbrk () from /lib/libc.so.7 #15 0x000000080142e75b in sbrk () from /lib/libc.so.7 #16 0x00000008014398bd in free () from /lib/libc.so.7 #17 0x0000000000aca676 in AllocSetFree (context=0x801e710d0, pointer=0x801e65038) at aset.c:976 #18 0x0000000000acbe93 in pfree (pointer=0x801e65038) at mcxt.c:1015 #19 0x00000000004a7986 in ginendscan (scan=0x801e61de0) at ginscan.c:445 #20 0x0000000000504818 in index_endscan (scan=0x801e61de0) at indexam.c:339 #21 0x0000000000719d21 in ExecEndBitmapIndexScan (node=0x801e619c8) at nodeBitmapIndexscan.c:183 #22 0x00000000006fce9e in ExecEndNode (node=0x801e619c8) at execProcnode.c:685 #23 0x0000000000719195 in ExecEndBitmapHeapScan (node=0x801d63700) at nodeBitmapHeapscan.c:508 #24 0x00000000006fceaf in ExecEndNode (node=0x801d63700) at execProcnode.c:689 #25 0x000000000072b64a in ExecEndModifyTable (node=0x801d632a0) at nodeModifyTable.c:1978 #26 0x00000000006fcde3 in ExecEndNode (node=0x801d632a0) at execProcnode.c:638 #27 0x00000000006f6ed9 in ExecEndPlan (planstate=0x801d632a0, estate=0x801d63038) at execMain.c:1451 #28 0x00000000006f6e56 in standard_ExecutorEnd (queryDesc=0x801e42af0) at execMain.c:468 #29 0x00000008020038f2 in pgss_ExecutorEnd (queryDesc=0x801e42af0) at pg_stat_statements.c:938 #30 0x00000000006f6d3c in ExecutorEnd (queryDesc=0x801e42af0) at execMain.c:437 #31 0x00000000008ea387 in ProcessQuery (plan=0x801e43898, sourceText=0x801e42838 "update foo set count=count+1 where text_array @> $1", params=0x801e428b8, dest=0xf3fcc8, completionTag=0x7fffffffdd00 "UPDATE 1") at pquery.c:230 #32 0x00000000008e9540 in PortalRunMulti (portal=0x801dc5038, isTopLevel=1 '\001', dest=0xf3fcc8, altdest=0xf3fcc8, completionTag=0x7fffffffdd00 "UPDATE 1") at pquery.c:1267 #33 0x00000000008e8cd6 in PortalRun (portal=0x801dc5038, count=9223372036854775807, isTopLevel=1 '\001', dest=0x801c96450, altdest=0x801c96450, completionTag=0x7fffffffdd00 "UPDATE 1") at pquery.c:813 #34 0x00000000008e61ef in exec_execute_message (portal_name=0x801c96038 "", max_rows=9223372036854775807) at postgres.c:1979 #35 0x00000000008e39ae in PostgresMain (argc=1, argv=0x801d56bc8, dbname=0x801d56ba8 "teodor", username=0x801d56b88 "teodor") at postgres.c:4122 #36 0x0000000000839744 in BackendRun (port=0x801d571c0) at postmaster.c:4258 #37 0x0000000000838d54 in BackendStartup (port=0x801d571c0) at postmaster.c:3932 #38 0x0000000000835617 in ServerLoop () at postmaster.c:1690 #39 0x0000000000832c69 in PostmasterMain (argc=4, argv=0x7fffffffe420) at postmaster.c:1298 #40 0x000000000075f228 in main (argc=4, argv=0x7fffffffe420) at main.c:228 -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Re: quickdie doing memory allocations (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
Hi Teodor, Thanks for analyzing this. On 2016-05-05 13:50:09 +0300, Teodor Sigaev wrote: > > I'll try to get a coredump after SIGSEGV, but it could take a time. > > Got it! > > #0 0x00000008014321d7 in sbrk () from /lib/libc.so.7 > #1 0x0000000801431ddd in sbrk () from /lib/libc.so.7 > #2 0x000000080142e5bb in sbrk () from /lib/libc.so.7 > #3 0x000000080142e085 in sbrk () from /lib/libc.so.7 > #4 0x000000080142de28 in sbrk () from /lib/libc.so.7 > #5 0x000000080142e1cf in sbrk () from /lib/libc.so.7 > #6 0x0000000801439815 in free () from /lib/libc.so.7 > #7 0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7 > #8 0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7 > #9 0x000000080144525c in exit () from /lib/libc.so.7 > #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623 > #11 <signal handler called> > #12 0x0000000801431847 in sbrk () from /lib/libc.so.7 > #13 0x0000000801431522 in sbrk () from /lib/libc.so.7 > #14 0x000000080142d47f in sbrk () from /lib/libc.so.7 > #15 0x0000000801434628 in malloc () from /lib/libc.so.7 > #16 0x0000000000aca278 in AllocSetAlloc (context=0x801c0bb88, size=24) at aset.c:853 > #17 0x0000000000acca0e in MemoryContextAlloc (context=0x801c0bb88, size=24) > at mcxt.c:764 > #18 0x0000000000aebdb8 in PushActiveSnapshot (snap=0xf4ae10) at snapmgr.c:652 > #19 0x00000000008e54bd in exec_bind_message (input_message=0x7fffffffdf60) > at postgres.c:1602 > #20 0x00000000008e3957 in PostgresMain (argc=1, argv=0x801d3c968, > dbname=0x801d3c948 "teodor", username=0x801d3c928 "teodor") at > postgres.c:4105 > #21 0x0000000000839744 in BackendRun (port=0x801c991c0) at postmaster.c:4258 > #22 0x0000000000838d54 in BackendStartup (port=0x801c991c0) at postmaster.c:3932 > #23 0x0000000000835617 in ServerLoop () at postmaster.c:1690 > #24 0x0000000000832c69 in PostmasterMain (argc=4, argv=0x7fffffffe420) at > postmaster.c:1298 > #25 0x000000000075f228 in main (argc=4, argv=0x7fffffffe420) at main.c:228 > > Seems, we have some memory corruption, but it could either separate or the > same problem. That looks like independent issue, namely that we're trigger memory allocations from a signal handler (see frames 12, 11, 10, 9). Presumably due to system registered atexit handlers. I suspect we should be using _exit() here? Tom? Greetings, Andres Freund
Hi Jeff, On 2016-04-29 10:38:55 -0700, Jeff Janes wrote: > I don't see the problem with an cassert-enabled, probably because it > is just too slow to ever reach the point where the problem occurs. Running the test with cassert enabled I actually get assertion failures, due to the FATAL you added. #1 0x0000000000958dde in ExceptionalCondition (conditionName=0xb36c2a "!(RefCountErrors == 0)", errorType=0xb361af "FailedAssertion", fileName=0xb36170 "/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c", lineNumber=2506)at /home/admin/src/postgresql/src/backend/utils/error/assert.c:54 #2 0x00000000007c9fc9 in CheckForBufferLeaks () at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2506 #3 0x00000000007c9f09 in AtProcExit_Buffers (code=1, arg=0) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2459 #4 0x00000000007d927f in shmem_exit (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:261 #5 0x00000000007d90dd in proc_exit_prepare (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:185 #6 0x00000000007d904b in proc_exit (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:102 #7 0x000000000095958d in errfinish (dummy=0) at /home/admin/src/postgresql/src/backend/utils/error/elog.c:543 #8 0x000000000080214b in mdwrite (reln=0x2e8b4a8, forknum=MAIN_FORKNUM, blocknum=154, buffer=0x2e8e5a8 "", skipFsync=0 '\000') at /home/admin/src/postgresql/src/backend/storage/smgr/md.c:832 #9 0x0000000000804633 in smgrwrite (reln=0x2e8b4a8, forknum=MAIN_FORKNUM, blocknum=154, buffer=0x2e8e5a8 "", skipFsync=0'\000') at /home/admin/src/postgresql/src/backend/storage/smgr/smgr.c:650 #10 0x00000000007ca548 in FlushBuffer (buf=0x7f0285955330, reln=0x2e8b4a8) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2734 #11 0x00000000007c9d5a in SyncOneBuffer (buf_id=2503, skip_recently_used=0 '\000', wb_context=0x7ffe7305d290) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2377 #12 0x00000000007c964e in BufferSync (flags=64) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:1967 #13 0x00000000007ca185 in CheckPointBuffers (flags=64) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2561 #14 0x000000000052d497 in CheckPointGuts (checkPointRedo=382762776, flags=64) at /home/admin/src/postgresql/src/backend/access/transam/xlog.c:8644 #15 0x000000000052cede in CreateCheckPoint (flags=64) at /home/admin/src/postgresql/src/backend/access/transam/xlog.c:8430 #16 0x00000000007706ac in CheckpointerMain () at /home/admin/src/postgresql/src/backend/postmaster/checkpointer.c:488 #17 0x000000000053e0d5 in AuxiliaryProcessMain (argc=2, argv=0x7ffe7305ea40) at /home/admin/src/postgresql/src/backend/bootstrap/bootstrap.c:429 #18 0x000000000078099f in StartChildProcess (type=CheckpointerProcess) at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:5227 #19 0x000000000077dcc3 in reaper (postgres_signal_arg=17) at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:2781 #20 <signal handler called> #21 0x00007f028ebbdac3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81 #22 0x000000000077c049 in ServerLoop () at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:1654 #23 0x000000000077b7a9 in PostmasterMain (argc=4, argv=0x2e49f20) at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:1298 #24 0x00000000006c5849 in main (argc=4, argv=0x2e49f20) at /home/admin/src/postgresql/src/backend/main/main.c:228 You didn't see those? The trigger here appears to be that the checkpointer doesn't have on-exit callback similar to a normal backend's ShutdownPostgres() et al, and thus doesn't trigger a resource owner release. The normal ERROR path has /* buffer pins are released here: */ ResourceOwnerRelease(CurrentResourceOwner, RESOURCE_RELEASE_BEFORE_LOCKS, false, true); /* we needn't bother with the other ResourceOwnerReleasephases */ That clearly is a bug. But I'm not immediately seing how this could trigger the corruption issue you observed. Greetings, Andres Freund
On 2016-05-04 18:12:45 +0300, Teodor Sigaev wrote: > > > I get the errors: > > > > > > ERROR: attempted to delete invisible tuple > > > STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2 > > > > > > And also: > > > > > > ERROR: unexpected chunk number 1 (expected 2) for toast value > > > 85223889 in pg_toast_16424 > > > STATEMENT: update foo set count=count+1 where text_array @> $1 > > > > Hm. I appear to have trouble reproducing this issue (continuing to try) > > on master as of 8826d8507. Is there any chance you could package up a > > data directory after the issue hit? > > I've got > ERROR: unexpected chunk number 0 (expected 1) for toast value 10192986 in > pg_toast_16424 > > The test required 10 hours to run on my notebook. postgresql was compiled > with -O0 --enable-debug --enable-cassert. Hm. And you're not seeing the asserts I reported in http://archives.postgresql.org/message-id/20160505185246.2i7qftadwhzewykj%40alap3.anarazel.de ? Greetings, Andres Freund
> Hm. And you're not seeing the asserts I reported in > http://archives.postgresql.org/message-id/20160505185246.2i7qftadwhzewykj%40alap3.anarazel.de > ? > I see it a lot, but I think that is a result of ereport(FATAL) after FileWrite(BLCKSZ/3) added by Jeff. Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Andres Freund <andres@anarazel.de> writes: >> #0 0x00000008014321d7 in sbrk () from /lib/libc.so.7 >> #1 0x0000000801431ddd in sbrk () from /lib/libc.so.7 >> #2 0x000000080142e5bb in sbrk () from /lib/libc.so.7 >> #3 0x000000080142e085 in sbrk () from /lib/libc.so.7 >> #4 0x000000080142de28 in sbrk () from /lib/libc.so.7 >> #5 0x000000080142e1cf in sbrk () from /lib/libc.so.7 >> #6 0x0000000801439815 in free () from /lib/libc.so.7 >> #7 0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7 >> #8 0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7 >> #9 0x000000080144525c in exit () from /lib/libc.so.7 >> #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623 >> #11 <signal handler called> >> #12 0x0000000801431847 in sbrk () from /lib/libc.so.7 > That looks like independent issue, namely that we're trigger memory > allocations from a signal handler (see frames 12, 11, 10, 9). Presumably > due to system registered atexit handlers. I suspect we should be using > _exit() here? Tom? I don't think that would improve matters. In the first place, if we use _exit() here that might encourage third-party extension authors to believe they should use _exit(), which would be bad. In the second place, we don't know what it is we're skipping by not running atexit handlers, and again that could be bad. We don't like people trying to bypass our on-exit code, why would anyone else? In the third place, by the time we get to the exit() call we've already exposed ourselves to a whole lot of such hazards by running ereport() (including sending a message to the client!). In the fourth place, if we've received a quickdie interrupt, it doesn't actually matter if the process crashes; we just want it to quit ASAP. So I'd say that this is just a cosmetic problem and that trying to fix it is likely to make things worse. regards, tom lane
Re: quickdie doing memory allocations (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
Hi, On 2016-05-05 15:56:45 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > >> #0 0x00000008014321d7 in sbrk () from /lib/libc.so.7 > >> #1 0x0000000801431ddd in sbrk () from /lib/libc.so.7 > >> #2 0x000000080142e5bb in sbrk () from /lib/libc.so.7 > >> #3 0x000000080142e085 in sbrk () from /lib/libc.so.7 > >> #4 0x000000080142de28 in sbrk () from /lib/libc.so.7 > >> #5 0x000000080142e1cf in sbrk () from /lib/libc.so.7 > >> #6 0x0000000801439815 in free () from /lib/libc.so.7 > >> #7 0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7 > >> #8 0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7 > >> #9 0x000000080144525c in exit () from /lib/libc.so.7 > >> #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623 > >> #11 <signal handler called> > >> #12 0x0000000801431847 in sbrk () from /lib/libc.so.7 > > > That looks like independent issue, namely that we're trigger memory > > allocations from a signal handler (see frames 12, 11, 10, 9). Presumably > > due to system registered atexit handlers. I suspect we should be using > > _exit() here? Tom? > > I don't think that would improve matters. In the first place, if we use > _exit() here that might encourage third-party extension authors to believe > they should use _exit(), which would be bad. The sourcetree already has a number of _exit() calls, so I don't think that'd make a meaningfull difference. > In the second place, we don't know what it is we're skipping by not > running atexit handlers, and again that could be bad. I've a hard time coming up with a scenario where that'd be a problem in a PANIC case. Isn't it pretty common to use _exit after fatal errors (and forks)? > In the third place, by the time we > get to the exit() call we've already exposed ourselves to a whole lot of > such hazards by running ereport() (including sending a message to the > client!). True. And that's not good. But the magic of ErrorContext shields us from a fair amount of issues. > In the fourth place, if we've received a quickdie interrupt, > it doesn't actually matter if the process crashes; we just want it to > quit ASAP. If it always were crashing, that'd be somewhat fine. But sbrk internally uses mutexes, so this can result in processes getting stuck. And that is a problem. There've actually been reports about that every now and then. Greetings, Andres Freund
On Thu, May 5, 2016 at 11:51 AM, Andres Freund <andres@anarazel.de> wrote: >> #7 0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7 >> #8 0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7 >> #9 0x000000080144525c in exit () from /lib/libc.so.7 >> #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623 Eh, this doesn't this __cxa_finalize() stuff suggest that some C++ code was linked into the backend? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: quickdie doing memory allocations (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-05 16:32:38 -0400, Robert Haas wrote: > On Thu, May 5, 2016 at 11:51 AM, Andres Freund <andres@anarazel.de> wrote: > >> #7 0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7 > >> #8 0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7 > >> #9 0x000000080144525c in exit () from /lib/libc.so.7 > >> #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623 > > Eh, this doesn't this __cxa_finalize() stuff suggest that some C++ > code was linked into the backend? IIRC __cxa_finalize also handles atexit() (and gcc __attribute__((destructor))). Greetings, Andres Freund
On 2016-05-05 11:52:46 -0700, Andres Freund wrote: > Hi Jeff, > > On 2016-04-29 10:38:55 -0700, Jeff Janes wrote: > > I don't see the problem with an cassert-enabled, probably because it > > is just too slow to ever reach the point where the problem occurs. > > Running the test with cassert enabled I actually get assertion failures, > due to the FATAL you added. > > #1 0x0000000000958dde in ExceptionalCondition (conditionName=0xb36c2a "!(RefCountErrors == 0)", errorType=0xb361af "FailedAssertion", > fileName=0xb36170 "/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c", lineNumber=2506) at /home/admin/src/postgresql/src/backend/utils/error/assert.c:54 > #2 0x00000000007c9fc9 in CheckForBufferLeaks () at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2506 > #3 0x00000000007c9f09 in AtProcExit_Buffers (code=1, arg=0) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2459 > #4 0x00000000007d927f in shmem_exit (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:261 > #5 0x00000000007d90dd in proc_exit_prepare (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:185 > #6 0x00000000007d904b in proc_exit (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:102 > #7 0x000000000095958d in errfinish (dummy=0) at /home/admin/src/postgresql/src/backend/utils/error/elog.c:543 > #8 0x000000000080214b in mdwrite (reln=0x2e8b4a8, forknum=MAIN_FORKNUM, blocknum=154, buffer=0x2e8e5a8 "", skipFsync=0'\000') > at /home/admin/src/postgresql/src/backend/storage/smgr/md.c:832 > #9 0x0000000000804633 in smgrwrite (reln=0x2e8b4a8, forknum=MAIN_FORKNUM, blocknum=154, buffer=0x2e8e5a8 "", skipFsync=0'\000') > at /home/admin/src/postgresql/src/backend/storage/smgr/smgr.c:650 > #10 0x00000000007ca548 in FlushBuffer (buf=0x7f0285955330, reln=0x2e8b4a8) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2734 > #11 0x00000000007c9d5a in SyncOneBuffer (buf_id=2503, skip_recently_used=0 '\000', wb_context=0x7ffe7305d290) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2377 > #12 0x00000000007c964e in BufferSync (flags=64) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:1967 > #13 0x00000000007ca185 in CheckPointBuffers (flags=64) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2561 > #14 0x000000000052d497 in CheckPointGuts (checkPointRedo=382762776, flags=64) at /home/admin/src/postgresql/src/backend/access/transam/xlog.c:8644 > #15 0x000000000052cede in CreateCheckPoint (flags=64) at /home/admin/src/postgresql/src/backend/access/transam/xlog.c:8430 > #16 0x00000000007706ac in CheckpointerMain () at /home/admin/src/postgresql/src/backend/postmaster/checkpointer.c:488 > #17 0x000000000053e0d5 in AuxiliaryProcessMain (argc=2, argv=0x7ffe7305ea40) at /home/admin/src/postgresql/src/backend/bootstrap/bootstrap.c:429 > #18 0x000000000078099f in StartChildProcess (type=CheckpointerProcess) at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:5227 > #19 0x000000000077dcc3 in reaper (postgres_signal_arg=17) at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:2781 > #20 <signal handler called> > #21 0x00007f028ebbdac3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81 > #22 0x000000000077c049 in ServerLoop () at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:1654 > #23 0x000000000077b7a9 in PostmasterMain (argc=4, argv=0x2e49f20) at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:1298 > #24 0x00000000006c5849 in main (argc=4, argv=0x2e49f20) at /home/admin/src/postgresql/src/backend/main/main.c:228 > > You didn't see those? > > > The trigger here appears to be that the checkpointer doesn't have > on-exit callback similar to a normal backend's ShutdownPostgres() et al, > and thus doesn't trigger a resource owner release. The normal ERROR > path has > /* buffer pins are released here: */ > ResourceOwnerRelease(CurrentResourceOwner, > RESOURCE_RELEASE_BEFORE_LOCKS, > false, true); > /* we needn't bother with the other ResourceOwnerRelease phases */ > > That clearly is a bug. But I'm not immediately seing how this could > trigger the corruption issue you observed. The same issue exists in bgwriter afaics. ISTM that we need to provide an before_shmem_exit (or on_shmem_exit?) handler for both which essentially does/* * These operations are really just a minimalsubset of * AbortTransaction(). We don't have very many resources to worry * about in bgwriter, but we do have LWLocks,buffers, and temp files. */LWLockReleaseAll();AbortBufferIO();UnlockBuffers();/* buffer pins are released here: */ResourceOwnerRelease(CurrentResourceOwner, RESOURCE_RELEASE_BEFORE_LOCKS, false,true); it looks to me like that should be backpatched? There's some question about how to make the ordering vs. AtProcExit_Buffers robust; which is why I'm above explicitly doing LWLockReleaseAll/AbortBufferIO/UnlockBuffers. Any better ideas?
On Thu, May 5, 2016 at 11:52 AM, Andres Freund <andres@anarazel.de> wrote: > Hi Jeff, > > On 2016-04-29 10:38:55 -0700, Jeff Janes wrote: >> I don't see the problem with an cassert-enabled, probably because it >> is just too slow to ever reach the point where the problem occurs. > > Running the test with cassert enabled I actually get assertion failures, > due to the FATAL you added. > > #1 0x0000000000958dde in ExceptionalCondition (conditionName=0xb36c2a "!(RefCountErrors == 0)", errorType=0xb361af "FailedAssertion", > fileName=0xb36170 "/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c", lineNumber=2506) at /home/admin/src/postgresql/src/backend/utils/error/assert.c:54 > #2 0x00000000007c9fc9 in CheckForBufferLeaks () at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2506 ... > > You didn't see those? Yes, I have been seeing those on assert-enabled builds going back as far as I can remember (long before this particular problem started showing up). I just assumed it was a natural consequence of throwing an ERROR from inside a critical section. I never really understood it, why would a panicking process bother to check for buffer leaks in the first place? It is leaking everything, which is why the entire system has to be brought down immediately. I have been trying (and failing) to reproduce the problem in more recent releases, with and without cassert. Here is pg_config output of one of my current attempts: BINDIR = /home/jjanes/pgsql/torn_bisect/bin DOCDIR = /home/jjanes/pgsql/torn_bisect/share/doc HTMLDIR = /home/jjanes/pgsql/torn_bisect/share/doc INCLUDEDIR = /home/jjanes/pgsql/torn_bisect/include PKGINCLUDEDIR = /home/jjanes/pgsql/torn_bisect/include INCLUDEDIR-SERVER = /home/jjanes/pgsql/torn_bisect/include/server LIBDIR = /home/jjanes/pgsql/torn_bisect/lib PKGLIBDIR = /home/jjanes/pgsql/torn_bisect/lib LOCALEDIR = /home/jjanes/pgsql/torn_bisect/share/locale MANDIR = /home/jjanes/pgsql/torn_bisect/share/man SHAREDIR = /home/jjanes/pgsql/torn_bisect/share SYSCONFDIR = /home/jjanes/pgsql/torn_bisect/etc PGXS = /home/jjanes/pgsql/torn_bisect/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = 'CFLAGS=-ggdb' '--with-extra-version=-c1543a8' '--enable-debug' '--with-libxml' '--with-perl' '--with-python' '--with-ldap' '--with-openssl' '--with-gssapi' '--enable-cassert' '--prefix=/home/jjanes/pgsql/torn_bisect/' CC = gcc CPPFLAGS = -DFRONTEND -D_GNU_SOURCE -I/usr/include/libxml2 CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -ggdb CFLAGS_SL = -fpic LDFLAGS = -L../../src/common -Wl,--as-needed -Wl,-rpath,'/home/jjanes/pgsql/torn_bisect/lib',--enable-new-dtags LDFLAGS_EX = LDFLAGS_SL = LIBS = -lpgcommon -lpgport -lxml2 -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt -lcrypt -ldl -lm VERSION = PostgreSQL 9.6devel-c1543a8 The only difference between this and the ones that did find the ERR would be toggling --enable-cassert and changing which git commit was used (and manually applying the gin_alone patch when testing commits that precede that one's committal. Linux: 2.6.32-573.22.1.el6.x86_64 #1 SMP Wed Mar 23 03:35:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux /proc/cpu_info: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5-4640 v2 @ 2.20GHz stepping : 4 microcode : 4294967295 cpu MHz : 2199.933 cache size : 20480 KB physical id : 0 siblings : 8 core id : 7 cpu cores : 8 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 4399.86 clflush size : 64 cache_alignment : 64 address sizes : 42 bits physical, 48 bits virtual power management: Cheers, Jeff
On 2016-05-06 11:15:03 -0700, Jeff Janes wrote: > > Running the test with cassert enabled I actually get assertion failures, > > due to the FATAL you added. > > > > #1 0x0000000000958dde in ExceptionalCondition (conditionName=0xb36c2a "!(RefCountErrors == 0)", errorType=0xb361af "FailedAssertion", > > fileName=0xb36170 "/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c", lineNumber=2506) at /home/admin/src/postgresql/src/backend/utils/error/assert.c:54 > > #2 0x00000000007c9fc9 in CheckForBufferLeaks () at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2506 > ... > > > > You didn't see those? > > Yes, I have been seeing those on assert-enabled builds going back as > far as I can remember (long before this particular problem started > showing up). > \I just assumed it was a natural consequence of throwing > an ERROR from inside a critical section. There's no critical section here... The reason we're restarting is because the FATAL makes the checkpointer exit, which postmaster treats as a cause to trigger a PANIC: /* * Any unexpected exit of the checkpointer (including FATAL * exit) is treated as a crash. */ HandleChildCrash(pid, exitstatus, _("checkpointer process")); > I never really understood > it, why would a panicking process bother to check for buffer leaks in > the first place? It is leaking everything, which is why the entire > system has to be brought down immediately. Panicing ones don't... > I have been trying (and failing) to reproduce the problem in more > recent releases, with and without cassert. Here is pg_config output > of one of my current attempts: If you say "recent releases" you mean that you've not been able to reproduce it in 9.5, 9.4, ..., not that the issue has vanished in master, right? Greetings, Andres Freund
On Fri, May 6, 2016 at 11:24 AM, Andres Freund <andres@anarazel.de> wrote: > >> I have been trying (and failing) to reproduce the problem in more >> recent releases, with and without cassert. Here is pg_config output >> of one of my current attempts: > > If you say "recent releases" you mean that you've not been able to > reproduce it in 9.5, 9.4, ..., not that the issue has vanished in > master, right? Sorry, I meant recent commits, not releases. I have not been able to reproduce it in master (anything after 8f1911d5e6), but I also can't find the commit which fixed it, because I don't know how many hours of running without an error is enough to declare it good. I've already gotten burned by that by declaring 8f1911d5e6 good once, only to find out it was still bad just with a lower probability of finding it. I also can't reproduce it in 9.5, but I wouldn't expect it to given the large changes in the way GIN indexes operates between 9.5 and 9.6, which completely changes the timing and types of races (even ones outside of the GIN code) one might expect to see. I'm pretty sure I would have caught it during 9.5's beta if it were findable with this test, as this test was run back then in pretty close to the current form. I'm not saying their isn't a bug in 9.5, just that if there is this test is not efficient at invoking it. So the issue *has* vanished in master, but without knowing what fixed it I have no confidence it was actually fixed, rather than the test just stopped being effective. What I plan on doing now is going back to the part of the commit history where I could reproduce it reliably, and start throwing unnecessary things of the ./configure and the postgresql.conf to see what the minimal configuration is at which I can reproduce it. Sorry for the confusion. Cheers, Jeff
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
Hi, On 2016-04-29 10:38:55 -0700, Jeff Janes wrote: > I've bisected the errors I was seeing, discussed in > http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com > > It look like they first appear in: > > commit 48354581a49c30f5757c203415aa8412d85b0f70 > Author: Andres Freund <andres@anarazel.de> > Date: Sun Apr 10 20:12:32 2016 -0700 > > Allow Pin/UnpinBuffer to operate in a lockfree manner. > > > I get the errors: > > ERROR: attempted to delete invisible tuple > STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2 > > And also: > > ERROR: unexpected chunk number 1 (expected 2) for toast value > 85223889 in pg_toast_16424 > STATEMENT: update foo set count=count+1 where text_array @> $1 I'm a bit dumbfounded here. I think I found the issue, and, if so, it's *very* longstanding. HeapTupleSatisfiesToast(), aborted transactions, and hint bits (or oid wraparound) appear to be a broken combination. What I'm seing is that tuptoaster.c is trying to fetch or delete toast chunks generated in an aborted (via crash) transaction. And that, with the same valueid/chunk_id, a valid value exist somewhere else in the toast table. HeapTupleSatisfiesToast() does not do any actual tuple liveliness detection besides checking basic hint bits. Thus, if there's a dead tuple in the toast table's heap, which a *live* reference from the table, toast_fetch_datum() potentially returns the wrong results. That makes the new question: How can there be a live reference to a dead and reused toast id? As far as I can see the primary problem is that GetNewOidWithIndex() uses SnapshotDirty to verify whether a tuple is live. As that actually verifies tuple liveliness (and sets hint bits), it'll skip over an aborted toasted datum. A valid question is why the hint bits set by HeapTupleSatisfiesDirty() doesn't prevent that from occuring (besides standbys which aren't evolved): It's the constant crashes. Looking at a tuple which triggers the "unexpected chunk number" error, it indeed has a live and a dead version of the same chunk id. Interestingly the dead rows looks like: chunk_id = 1 - t_infomask: 0xA02 - HEAP_XMAX_INVALID | HEAP_XMIN_INVALID | HEAP_HASVARWIDTH chunk_id = 2 - t_infomask: 0x802 - HEAP_XMAX_INVALID | HEAP_HASVARWIDTH My rather strong suspicion on why that is, is that both tuples were visited by SnapshotDirty(), but only the first ones hint bits got persisted, because checksums were enabled. The second time through MarkBufferDirtyHint skipped work, because the buffer was already dirty. rmgr: XLOG len (rec/tot): 0/ 3809, tx: 206127725, lsn: 1C/EB3B0910, prev 1C/EB3B08E0, desc: FPI_FOR_HINT ,blkref #0: rel 1663/16384/16433 blk 202567 FPW that's indeed the only time the affected page is touched as visible to pg_xlogdump after being filled... The easy way to trigger this problem would be to have an oid wraparound - but the WAL shows that that's not the case here. I've not figured that one out entirely (and won't tonight). But I do see WAL records like: rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693 rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693 i.e. two NEXTOID records allocating the same range, which obviously doesn't seem right. There's also every now and then close by ranges: rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455 rmgr: XLOG len (rec/tot): 4/ 30, tx: 7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461 As far as I can see something like the above, or an oid wraparound, are pretty much deadly for toast. Is anybody ready with a good defense for SatisfiesToast not doing any actual liveliness checks? Andres
On 10 May 2016 at 09:05, Andres Freund <andres@anarazel.de> wrote:
--
Is anybody ready with a good defense for SatisfiesToast not doing any
actual liveliness checks?
I provided a patch earlier that rechecks the OID fetched from a toast chunk matches the OID requested.
I didn't commit it, I just used it to check the patch which changed btree vacuum replay.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote: > The easy way to trigger this problem would be to have an oid wraparound > - but the WAL shows that that's not the case here. I've not figured > that one out entirely (and won't tonight). But I do see WAL records > like: > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693 > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693 > i.e. two NEXTOID records allocating the same range, which obviously > doesn't seem right. There's also every now and then close by ranges: > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455 > rmgr: XLOG len (rec/tot): 4/ 30, tx: 7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461 > > > As far as I can see something like the above, or an oid wraparound, are > pretty much deadly for toast. > > Is anybody ready with a good defense for SatisfiesToast not doing any > actual liveliness checks? I assume that this was installed as a performance optimization, and I don't really see why it shouldn't be or be able to be made safe. I assume that the wraparound case was deemed safe because at that time the idea of 4 billion OIDs getting used with old transactions still active seemed inconceivable. It seems to me that the real question here is how you're getting two calls to XLogPutNextOid() with the same value of ShmemVariableCache->nextOid, and the answer, as it seems to me, must be that LWLocks are broken. Either two processes are managing to hold OidGenLock in exclusive mode at the same time, or they're acquiring it in quick succession but without the second process seeing all of the updates performed by the first process. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund <andres@anarazel.de> writes: > Is anybody ready with a good defense for SatisfiesToast not doing any > actual liveliness checks? As long as we do not update toast values after creation, there is no need; the liveness check on the parent tuple is what's important. Adding a liveness check on the toast item would merely create new failure modes with no corresponding benefit. Imagine deciding that field 3 of a regular tuple was somehow dead even though the rest of the tuple is live --- how can that be good? I concur with Robert that what this looks like is failure to ensure that toast OIDs are unique, which is an entirely different problem. regards, tom lane
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 08:09:02 -0400, Robert Haas wrote: > On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote: > > The easy way to trigger this problem would be to have an oid wraparound > > - but the WAL shows that that's not the case here. I've not figured > > that one out entirely (and won't tonight). But I do see WAL records > > like: > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693 > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693 > > i.e. two NEXTOID records allocating the same range, which obviously > > doesn't seem right. There's also every now and then close by ranges: > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455 > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461 > > > > > > As far as I can see something like the above, or an oid wraparound, are > > pretty much deadly for toast. > > > > Is anybody ready with a good defense for SatisfiesToast not doing any > > actual liveliness checks? > > I assume that this was installed as a performance optimization, and I > don't really see why it shouldn't be or be able to be made safe. I > assume that the wraparound case was deemed safe because at that time > the idea of 4 billion OIDs getting used with old transactions still > active seemed inconceivable. It's not super likely, yea. But you don't really need to "use" 4 billion oids to get a wraparound. Once you have a significant number of values in various toast tables, the oid counter progresses really rather fast, without many writes. That's because the oid counter is global, but each individual toast write (and other things), perform checks via GetNewOidWithIndex(). I'm not sure why you think it's safe? Consider the following scenario: BEGIN; -- nextoid: 1 INSERT toastval = chunk_id = 1; ROLLBACK: ... -- oid counter wraps around -- nextoid: 1 INSERT toastval = chunk_id = 1; -- crash, loosing all pending hint bits SELECT toastval; The last SELECT might find either of the toasted data chunks, depending on heap ordering. As they're not hinted as invalid due to the crash, HeapTupleSatisfiesToast() will return both as visible. To make that safe we'd at least make hint bit writes by the scan in GetNewOidWithIndex() durable, and likely also disable the killtuples optimization; to avoid a plain SELECT of the toast table to make some tuples unreachable, but not durably hinted. That seems fairly fragile. I've a significant amount of doubt that toast reads are bottlenecked by visibility routines. > It seems to me that the real question > here is how you're getting two calls to XLogPutNextOid() with the same > value of ShmemVariableCache->nextOid, and the answer, as it seems to > me, must be that LWLocks are broken. There likely were a bunch of crashes in between, Jeff's test suite triggers them at a high rate. It seems a lot more likely than that an lwlock bug only materializes in the oid counter. Investigating. Greetings, Andres Freund
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 12:28:57 +0200, Simon Riggs wrote: > On 10 May 2016 at 09:05, Andres Freund <andres@anarazel.de> wrote: > > > > Is anybody ready with a good defense for SatisfiesToast not doing any > > actual liveliness checks? > > > > I provided a patch earlier that rechecks the OID fetched from a toast chunk > matches the OID requested. They match in this case, so that's not likely to help with the issue at hand? - Andres
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 10:09:07 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Is anybody ready with a good defense for SatisfiesToast not doing any > > actual liveliness checks? > > As long as we do not update toast values after creation, there is no > need; the liveness check on the parent tuple is what's important. > Adding a liveness check on the toast item would merely create new > failure modes with no corresponding benefit. You mean besides not having a corrupted database like in this case? > Imagine deciding that field 3 of a regular tuple was somehow dead even > though the rest of the tuple is live --- how can that be good? How would ever happen in normal operation? And why are we checking hint bits in HeapTupleSatisfiesToast() in that case?if (!HeapTupleHeaderXminCommitted(tuple)){ if (HeapTupleHeaderXminInvalid(tuple)) return false; The only way I can see the scenario you're described happening is if there's a bug somewhere else. > I concur with Robert that what this looks like is failure to ensure > that toast OIDs are unique, which is an entirely different problem. Well, I did describe how this can come about due to a wraparound, and how GetNewOidWithIndex isn't sufficient to prevent the problem, as it only hints (which are by definition not durable)... Greetings, Andres Freund
On Tue, May 10, 2016 at 12:19 PM, Andres Freund <andres@anarazel.de> wrote: >> I assume that this was installed as a performance optimization, and I >> don't really see why it shouldn't be or be able to be made safe. I >> assume that the wraparound case was deemed safe because at that time >> the idea of 4 billion OIDs getting used with old transactions still >> active seemed inconceivable. > > It's not super likely, yea. But you don't really need to "use" 4 billion > oids to get a wraparound. Once you have a significant number of values > in various toast tables, the oid counter progresses really rather fast, > without many writes. That's because the oid counter is global, but each > individual toast write (and other things), perform checks via > GetNewOidWithIndex(). Understood. > I'm not sure why you think it's safe? Consider the following scenario: > > BEGIN; > -- nextoid: 1 > INSERT toastval = chunk_id = 1; > ROLLBACK: > ... > -- oid counter wraps around > -- nextoid: 1 > INSERT toastval = chunk_id = 1; Uh oh. That's really bad. As long as we vacuum the table every 4 billion OID counter uses, and no long-running transactions prevent us from doing cleanup, there will be no issue of this sort, but there's no guarantee of those things being true on modern hardware. And I doubt we want to have a relmindeadoid to go with relfrozenxid and relminmxid. Our last round of convincing VACUUM to take into account one more kind of wraparound prevention wasn't loads of fun. Not to mention the fact that this could wrap FAR faster. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 10, 2016 at 12:19 PM, Andres Freund <andres@anarazel.de> wrote: >> It's not super likely, yea. But you don't really need to "use" 4 billion >> oids to get a wraparound. Once you have a significant number of values >> in various toast tables, the oid counter progresses really rather fast, >> without many writes. That's because the oid counter is global, but each >> individual toast write (and other things), perform checks via >> GetNewOidWithIndex(). > Understood. Sooner or later we are going to need to go to 8-byte TOAST object identifiers. Maybe we should think about doing that sooner not later rather than trying to invent some anti-wraparound solution here. In principle, you could support existing TOAST tables and pointers containing 4-byte IDs in parallel with the new ones. Not sure how pg_upgrade would handle it exactly though. regards, tom lane
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Alvaro Herrera
Date:
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Tue, May 10, 2016 at 12:19 PM, Andres Freund <andres@anarazel.de> wrote: > >> It's not super likely, yea. But you don't really need to "use" 4 billion > >> oids to get a wraparound. Once you have a significant number of values > >> in various toast tables, the oid counter progresses really rather fast, > >> without many writes. That's because the oid counter is global, but each > >> individual toast write (and other things), perform checks via > >> GetNewOidWithIndex(). > > > Understood. > > Sooner or later we are going to need to go to 8-byte TOAST object > identifiers. Maybe we should think about doing that sooner not later > rather than trying to invent some anti-wraparound solution here. Umm, it seems to me like we need this fixed in supported branches, not just 9.7, so I don't think 8-byte toast IDs are a reasonable solution at this point. > In principle, you could support existing TOAST tables and pointers > containing 4-byte IDs in parallel with the new ones. Not sure how > pg_upgrade would handle it exactly though. I suppose the real problem is that there's no way to have a mix of 4- and 8-byte identifiers in the same toast table. I suppose we could have two toast tables for each regular table, but that sounds pretty painful. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 10 May 2016 at 17:20, Andres Freund <andres@anarazel.de> wrote:
--
On 2016-05-10 12:28:57 +0200, Simon Riggs wrote:
> On 10 May 2016 at 09:05, Andres Freund <andres@anarazel.de> wrote:
>
>
> > Is anybody ready with a good defense for SatisfiesToast not doing any
> > actual liveliness checks?
> >
>
> I provided a patch earlier that rechecks the OID fetched from a toast chunk
> matches the OID requested.
They match in this case, so that's not likely to help with the issue at
hand?
I can see, but I was answering your more general question above with what I had available.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> Sooner or later we are going to need to go to 8-byte TOAST object >> identifiers. Maybe we should think about doing that sooner not later >> rather than trying to invent some anti-wraparound solution here. > Umm, it seems to me like we need this fixed in supported branches, not > just 9.7, so I don't think 8-byte toast IDs are a reasonable solution at > this point. I think the problem is pretty hypothetical until you get to consuming a substantial part of the OID space within any one toast table, at which point you're going to need 8-byte toast OIDs. Improving that situation seems like something we can define as a future feature. (Note that I do not believe for a moment that Andres has actually exhibited such a problem in his test case. He'd have needed several TB worth of TOAST space to get to that point.) >> In principle, you could support existing TOAST tables and pointers >> containing 4-byte IDs in parallel with the new ones. Not sure how >> pg_upgrade would handle it exactly though. > I suppose the real problem is that there's no way to have a mix of 4- > and 8-byte identifiers in the same toast table. I suppose we could have > two toast tables for each regular table, but that sounds pretty painful. Hmm. It's not impossible by any means, since you could tell by looking at a toast OID which table it was in. And that way might provide a less painful upgrade process for an existing table that grew to the point of needing bigger OIDs. regards, tom lane
On Tue, May 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-05-10 08:09:02 -0400, Robert Haas wrote: >> On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote: >> > The easy way to trigger this problem would be to have an oid wraparound >> > - but the WAL shows that that's not the case here. I've not figured >> > that one out entirely (and won't tonight). But I do see WAL records >> > like: >> > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693 >> > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693 Were there any CHECKPOINT_SHUTDOWN records, or any other NEXTOID records, between those two records you show? My current test harness updates the scalar count field on every iteration, but changes the (probably toasted) text_array field with a probability of only 1% each time. Perhaps making that more likely (by changing line 186 of count.pl) would make it easier to trigger the bug. I'll try that in my next iteration of tests. Cheers, Jeff
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 09:19:16 -0700, Andres Freund wrote: > On 2016-05-10 08:09:02 -0400, Robert Haas wrote: > > On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote: > > > The easy way to trigger this problem would be to have an oid wraparound > > > - but the WAL shows that that's not the case here. I've not figured > > > that one out entirely (and won't tonight). But I do see WAL records > > > like: > > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693 > > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693 > > > i.e. two NEXTOID records allocating the same range, which obviously > > > doesn't seem right. There's also every now and then close by ranges: > > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455 > > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461 > > It seems to me that the real question > > here is how you're getting two calls to XLogPutNextOid() with the same > > value of ShmemVariableCache->nextOid, and the answer, as it seems to > > me, must be that LWLocks are broken. > > There likely were a bunch of crashes in between, Jeff's test suite > triggers them at a high rate. It seems a lot more likely than that an > lwlock bug only materializes in the oid counter. Investigating. void CreateCheckPoint(int flags) { ... /* * An end-of-recovery checkpoint is really a shutdown checkpoint, just * issued at a different time. */ if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) shutdown = true; else shutdown = false; ... LWLockAcquire(OidGenLock, LW_SHARED); checkPoint.nextOid = ShmemVariableCache->nextOid; if (!shutdown) checkPoint.nextOid+= ShmemVariableCache->oidCount; LWLockRelease(OidGenLock); ... recptr = XLogInsert(RM_XLOG_ID, shutdown ? XLOG_CHECKPOINT_SHUTDOWN : XLOG_CHECKPOINT_ONLINE); ... } I think that's to blame here. Looking at the relevant WAL record shows: pg_xlogdump -p /data/freund/jj -s 2/12004018 -e 2/1327EA28|grep -E 'CHECKPOINT|NEXTOID' rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693 rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 2/12023C38, prev 2/12023C00, desc: CHECKPOINT_ONLINEredo 2/12000120; /* ... */ oid 4294501; /* ... */ online rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 2/1327A798, prev 2/1327A768, desc: CHECKPOINT_SHUTDOWNredo 2/1327A798; /* ... */ oid 4294501; /* ... */ shutdown rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693 (note that end-of-recovery checkpoints are logged as shutdown checkpoints, pretty annoying imo) So I think the issue is that the 2/12023C38 checkpoint *starts* before the first NEXTOID, and thus gets an earlier nextoid. The second - shutdown/end-of-recovery - checkpoint then hits the above !shutdown and doesn't add oidCount. Thus after the crash we continue with a repeated NEXOID. There's this remark in xlog_redo(): /* * We used to try to take the maximum of ShmemVariableCache->nextOid * andthe recorded nextOid, but that fails if the OID counter wraps * around. Since no OID allocation should be happeningduring replay * anyway, better to just believe the record exactly. We still take * OidGenLock while settingthe variable, just in case. */ I think that was perhaps not the best fix :( I guess what we should do is to only use checkPoint.nextOid when starting up from a checkpoint, and entirely rely on NEXTOID otherwise? Regards, Andres
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 15:20:39 -0300, Alvaro Herrera wrote: > Tom Lane wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > On Tue, May 10, 2016 at 12:19 PM, Andres Freund <andres@anarazel.de> wrote: > > >> It's not super likely, yea. But you don't really need to "use" 4 billion > > >> oids to get a wraparound. Once you have a significant number of values > > >> in various toast tables, the oid counter progresses really rather fast, > > >> without many writes. That's because the oid counter is global, but each > > >> individual toast write (and other things), perform checks via > > >> GetNewOidWithIndex(). > > > > > Understood. > > > > Sooner or later we are going to need to go to 8-byte TOAST object > > identifiers. Maybe we should think about doing that sooner not later > > rather than trying to invent some anti-wraparound solution here. > > Umm, it seems to me like we need this fixed in supported branches, not > just 9.7, so I don't think 8-byte toast IDs are a reasonable solution at > this point. Agreed. To me it seems we need to make HeapTupleSatisfiesToast() safe, and other improvements are tangential. Andres
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 16:14:50 -0400, Tom Lane wrote: > I think the problem is pretty hypothetical until you get to consuming a > substantial part of the OID space within any one toast table, at which > point you're going to need 8-byte toast OIDs. Improving that situation > seems like something we can define as a future feature. Doesn't really have to be within a single toast table. As the oid counter is global, it just needs to advance quickly enough globally. > (Note that I do not believe for a moment that Andres has actually > exhibited such a problem in his test case. He'd have needed several > TB worth of TOAST space to get to that point.) That's good, because I explicitly said that there was no wraparound involved in the test case. > >> In principle, you could support existing TOAST tables and pointers > >> containing 4-byte IDs in parallel with the new ones. Not sure how > >> pg_upgrade would handle it exactly though. > > > I suppose the real problem is that there's no way to have a mix of 4- > > and 8-byte identifiers in the same toast table. I suppose we could have > > two toast tables for each regular table, but that sounds pretty painful. > > Hmm. It's not impossible by any means, since you could tell by looking > at a toast OID which table it was in. And that way might provide a less > painful upgrade process for an existing table that grew to the point of > needing bigger OIDs. The catalog representation (as in pg_class.reltoastrelid) isn't entirely clear to me. One way would be to invert pg_class.reltoastrelid's meaning and have the toast table point to the table it stores values for. That'd also open the potential of having one toast table per column and such. Greetings, Andres Freund
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 13:17:52 -0700, Jeff Janes wrote: > On Tue, May 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-05-10 08:09:02 -0400, Robert Haas wrote: > >> On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote: > >> > The easy way to trigger this problem would be to have an oid wraparound > >> > - but the WAL shows that that's not the case here. I've not figured > >> > that one out entirely (and won't tonight). But I do see WAL records > >> > like: > >> > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693 > >> > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693 > > Were there any CHECKPOINT_SHUTDOWN records, or any other NEXTOID > records, between those two records you show? Yes, check http://www.postgresql.org/message-id/20160510210013.2akn4iee7gl4ycen@alap3.anarazel.de I think the explanation about how the bug is occuring there makes sense. > My current test harness updates the scalar count field on every > iteration, but changes the (probably toasted) text_array field with a > probability of only 1% each time. Perhaps making that more likely (by > changing line 186 of count.pl) would make it easier to trigger the > bug. I'll try that in my next iteration of tests. So my current theory about why the whole thing is kinda hard to reproduce is that "luck" determines how aggressively the toast table is vacuumed, and how often it actually succeeds in being vacuumed. You also need a good bit of bad luck for the hint bits by GetNewOidWithIndex() to not survive, given that shared_buffers is pretty small *and* checksums are enabled. I guess testing with a bigger shared memory and without checksums will make it easier to hit the bug. Regards, Andres
Andres Freund <andres@anarazel.de> writes: > On 2016-05-10 16:14:50 -0400, Tom Lane wrote: >> I think the problem is pretty hypothetical until you get to consuming a >> substantial part of the OID space within any one toast table, at which >> point you're going to need 8-byte toast OIDs. Improving that situation >> seems like something we can define as a future feature. > Doesn't really have to be within a single toast table. As the oid > counter is global, it just needs to advance quickly enough globally. I don't think that alters the conclusion. To get rapid advancement through the whole 4G OID space, you'd need densely packed existing OIDs in all of the toast tables being inserted into. And the more toast tables in play, the less likely any one of them is densely packed, because it's likely some OIDs were assigned into other toast tables. Now having said that, you don't actually need *rapid* advancement of the OID counter to have a potential problem. Imagine that some transaction inserts a TOAST value and later fails, so that you have a dead-but-unhinted toast tuple sitting there. If no VACUUM comes along to clean it up (quite possible if there's low activity on the parent table), then after 4G uses of OIDs, you could be unlucky enough that the next attempted insertion into that TOAST table chooses the same OID. Now, that choosing would be done by GetNewOidWithIndex which would have to have seen the dead toast tuple; but if its hinting of the tuple got lost somehow then you could have two tuples with the same OID and neither hinted as dead. Then subsequent TOAST fetches have a problem that they might take the wrong one. Having said that, I still say that changing HeapTupleSatisfiesToast is the wrong thing. It can't go deciding not to return toast values because they're committed dead --- the parent tuple could easily be committed dead as well, and yet still be visible to our query's snapshot. Nor is it attractive to introduce more hint-bit-setting requirements for toast tuples; that just increases the amount of hint-bit-setting disk traffic that must occur subsequently to any update involving toast values. Maybe the right fix here is for toast OID assignment to use GetNewOidWithIndex with SnapshotToast, or some special-purpose snapshot that does the right thing, so that it would see such a dead toast row as being a conflict. (If you got enough such dead rows, this might make it hard to find a free OID; but hopefully vacuum would come along and fix things before it got to that.) regards, tom lane
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 18:29:59 -0400, Tom Lane wrote: > Now having said that, you don't actually need *rapid* advancement > of the OID counter to have a potential problem. Imagine that some > transaction inserts a TOAST value and later fails, so that you have > a dead-but-unhinted toast tuple sitting there. If no VACUUM comes > along to clean it up (quite possible if there's low activity on > the parent table) Or if there's primarily inserts... > Having said that, I still say that changing HeapTupleSatisfiesToast > is the wrong thing. It can't go deciding not to return toast values > because they're committed dead --- the parent tuple could easily be > committed dead as well, and yet still be visible to our query's > snapshot. Hm. Shouldn't a mvcc snapshot be able to differentiate between those cases? When are we looking up toast tuple that's *not* visible to the current snapshot? I guess there could be some uglyness around RETURNING for DELETE. I guess we'd have to use logic akin to SatisfiesVacuum, treating HEAPTUPLE_*_IN_PROGRESS/RECENTLY_DEAD as alive. > Nor is it attractive to introduce more hint-bit-setting > requirements for toast tuples; that just increases the amount of > hint-bit-setting disk traffic that must occur subsequently to > any update involving toast values. Yea, that's the big argument against going this way. > Maybe the right fix here is for toast OID assignment to use > GetNewOidWithIndex with SnapshotToast, or some special-purpose > snapshot that does the right thing, so that it would see such a > dead toast row as being a conflict. (If you got enough such > dead rows, this might make it hard to find a free OID; but hopefully > vacuum would come along and fix things before it got to that.) Seems like we'd have to either emit cleanup records, or treat even tuples that are hinted as dead as a conflict (lest they get lost in a crash or failover). Greetings, Andres Freund
On Tue, May 10, 2016 at 2:00 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-05-10 09:19:16 -0700, Andres Freund wrote: >> On 2016-05-10 08:09:02 -0400, Robert Haas wrote: >> > On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote: >> > > The easy way to trigger this problem would be to have an oid wraparound >> > > - but the WAL shows that that's not the case here. I've not figured >> > > that one out entirely (and won't tonight). But I do see WAL records >> > > like: >> > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693 >> > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693 >> > > i.e. two NEXTOID records allocating the same range, which obviously >> > > doesn't seem right. There's also every now and then close by ranges: >> > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455 >> > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461 > >> > It seems to me that the real question >> > here is how you're getting two calls to XLogPutNextOid() with the same >> > value of ShmemVariableCache->nextOid, and the answer, as it seems to >> > me, must be that LWLocks are broken. >> >> There likely were a bunch of crashes in between, Jeff's test suite >> triggers them at a high rate. It seems a lot more likely than that an >> lwlock bug only materializes in the oid counter. Investigating. > > void > CreateCheckPoint(int flags) > { > ... > /* > * An end-of-recovery checkpoint is really a shutdown checkpoint, just > * issued at a different time. > */ > if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) > shutdown = true; > else > shutdown = false; > ... > > LWLockAcquire(OidGenLock, LW_SHARED); > checkPoint.nextOid = ShmemVariableCache->nextOid; > if (!shutdown) > checkPoint.nextOid += ShmemVariableCache->oidCount; > LWLockRelease(OidGenLock); > ... > recptr = XLogInsert(RM_XLOG_ID, > shutdown ? XLOG_CHECKPOINT_SHUTDOWN : > XLOG_CHECKPOINT_ONLINE); > ... > } > > I think that's to blame here. Looking at the relevant WAL record shows: > > pg_xlogdump -p /data/freund/jj -s 2/12004018 -e 2/1327EA28|grep -E 'CHECKPOINT|NEXTOID' > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693 > rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 2/12023C38, prev 2/12023C00, desc: CHECKPOINT_ONLINEredo 2/12000120; /* ... */ oid 4294501; /* ... */ online By my understanding, this is the point at which the crash occurred. > rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 2/1327A798, prev 2/1327A768, desc: CHECKPOINT_SHUTDOWNredo 2/1327A798; /* ... */ oid 4294501; /* ... */ shutdown > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693 > > (note that end-of-recovery checkpoints are logged as shutdown > checkpoints, pretty annoying imo) > > So I think the issue is that the 2/12023C38 checkpoint *starts* before > the first NEXTOID, and thus gets an earlier nextoid. But isn't CreateCheckPoint called at the end of the checkpoint, not the start of it? I don't understand how it could be out-of-date at that point. But obviously it is. Cheers, Jeff
Andres Freund <andres@anarazel.de> writes: > On 2016-05-10 18:29:59 -0400, Tom Lane wrote: >> Having said that, I still say that changing HeapTupleSatisfiesToast >> is the wrong thing. It can't go deciding not to return toast values >> because they're committed dead --- the parent tuple could easily be >> committed dead as well, and yet still be visible to our query's >> snapshot. > Hm. Shouldn't a mvcc snapshot be able to differentiate between those > cases? HeapTupleSatisfiesToast doesn't have one. And changing things so that toast tuples are checked using MVCC rules is the wrong thing anyway, because it would require adding hint-bit update traffic for toast tables. > When are we looking up toast tuple that's *not* visible to the > current snapshot? Once again, it's the parent tuple where we should be doing the visibility check; noplace else. regards, tom lane
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 15:53:38 -0700, Jeff Janes wrote: > On Tue, May 10, 2016 at 2:00 PM, Andres Freund <andres@anarazel.de> wrote: > > I think that's to blame here. Looking at the relevant WAL record shows: > > > > pg_xlogdump -p /data/freund/jj -s 2/12004018 -e 2/1327EA28|grep -E 'CHECKPOINT|NEXTOID' > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693 > > rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 2/12023C38, prev 2/12023C00, desc: CHECKPOINT_ONLINEredo 2/12000120; /* ... */ oid 4294501; /* ... */ online > > By my understanding, this is the point at which the crash occurred. Right. > > rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 2/1327A798, prev 2/1327A768, desc: CHECKPOINT_SHUTDOWNredo 2/1327A798; /* ... */ oid 4294501; /* ... */ shutdown > > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693 > > > > (note that end-of-recovery checkpoints are logged as shutdown > > checkpoints, pretty annoying imo) > > > > So I think the issue is that the 2/12023C38 checkpoint *starts* before > > the first NEXTOID, and thus gets an earlier nextoid. > > > But isn't CreateCheckPoint called at the end of the checkpoint, not > the start of it? No, CreateCheckPoint() does it all. CreateCheckPoint(int flags) { ... /* 1) determine redo pointer */ WALInsertLockAcquireExclusive(); curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos); prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos); WALInsertLockRelease(); ... /* 2) determine oid */ LWLockAcquire(OidGenLock, LW_SHARED); checkPoint.nextOid = ShmemVariableCache->nextOid; if (!shutdown) checkPoint.nextOid += ShmemVariableCache->oidCount; LWLockRelease(OidGenLock); ... /* 3) actually checkpoint shared_buffers et al. */ CheckPointGuts(checkPoint.redo, flags); ... /* 4) log the checkpoint */ recptr = XLogInsert(RM_XLOG_ID, shutdown ? XLOG_CHECKPOINT_SHUTDOWN: XLOG_CHECKPOINT_ONLINE); ... } > I don't understand how it could be out-of-date at that point. But > obviously it is. A checkpoint logically "starts" at 1) in the above abbreviated CreateCheckPoint(), that's where recovery starts when starting up from that checkpoint. But inbetween 1) and 4) all other backends can continue to insert WAL, and it'll be replayed *before* the checkpoint's record itself. That means that if some backend generates a NEXTOID record between 2) and 4) (with largers checkpoints we're looking at minutes to an hour of time), it's effects will temporarily take effect (as in ShmemVariableCache->nextOid is updated), but XLOG_CHECKPOINT_ONLINE's replay will overwrite it unconditionally: void xlog_redo(XLogReaderState *record) { else if (info == XLOG_CHECKPOINT_ONLINE) { ... /* ... but still treat OID counter as exact */ LWLockAcquire(OidGenLock, LW_EXCLUSIVE); ShmemVariableCache->nextOid= checkPoint.nextOid; ShmemVariableCache->oidCount = 0; LWLockRelease(OidGenLock); Makes sense? Regards, Andres
On Tue, May 10, 2016 at 4:02 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-05-10 15:53:38 -0700, Jeff Janes wrote: >> >> But isn't CreateCheckPoint called at the end of the checkpoint, not >> the start of it? > > No, CreateCheckPoint() does it all. > > > CreateCheckPoint(int flags) > { > ... > /* 1) determine redo pointer */ > WALInsertLockAcquireExclusive(); > curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos); > prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos); > WALInsertLockRelease(); > ... > /* 2) determine oid */ > LWLockAcquire(OidGenLock, LW_SHARED); > checkPoint.nextOid = ShmemVariableCache->nextOid; > if (!shutdown) > checkPoint.nextOid += ShmemVariableCache->oidCount; > LWLockRelease(OidGenLock); > ... > /* 3) actually checkpoint shared_buffers et al. */ > CheckPointGuts(checkPoint.redo, flags); > ... > /* 4) log the checkpoint */ > recptr = XLogInsert(RM_XLOG_ID, > shutdown ? XLOG_CHECKPOINT_SHUTDOWN : > XLOG_CHECKPOINT_ONLINE); > ... > } OK, got it. I don' t know how I missed the bigger picture of that function in the first place. But, another perhaps stupid question, why do we care what the value of nextOid was at the start of the last successfully completed checkpoint? Wouldn't it make more sense to populate that part of the record at the end, not the start, of the checkpoint? > > >> I don't understand how it could be out-of-date at that point. But >> obviously it is. > > A checkpoint logically "starts" at 1) in the above abbreviated > CreateCheckPoint(), that's where recovery starts when starting up from > that checkpoint. But inbetween 1) and 4) all other backends can continue > to insert WAL, and it'll be replayed *before* the checkpoint's record > itself. That means that if some backend generates a NEXTOID record > between 2) and 4) (with largers checkpoints we're looking at minutes to > an hour of time), it's effects will temporarily take effect (as in > ShmemVariableCache->nextOid is updated), but XLOG_CHECKPOINT_ONLINE's > replay will overwrite it unconditionally: > void > xlog_redo(XLogReaderState *record) > { > else if (info == XLOG_CHECKPOINT_ONLINE) > { > ... > /* ... but still treat OID counter as exact */ > LWLockAcquire(OidGenLock, LW_EXCLUSIVE); > ShmemVariableCache->nextOid = checkPoint.nextOid; > ShmemVariableCache->oidCount = 0; > LWLockRelease(OidGenLock); > > Makes sense? So it seems like we should unconditionally **not** do that when replaying XLOG_CHECKPOINT_ONLINE, right? If this is the checkpoint record at which we started the system then we would have done that "replay" into ShmemVariableCache->nextOid during StartupXLOG(), there is no reason to do it again when redo passes through that record the 2nd time. But, to round this out, all of this is formally only a hint on where to start OIDs so as to avoid performance problems as you busy-loop looking for an unused OID. The real correctness bug is in the hint-bit problem you discuss elsewhere that turns collisions into corruption, and this bug just makes that other one reachable? Cheers, Jeff
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 17:36:06 -0700, Jeff Janes wrote: > OK, got it. I don' t know how I missed the bigger picture of that > function in the first place. > > But, another perhaps stupid question, why do we care what the value of > nextOid was at the start of the last successfully completed > checkpoint? Wouldn't it make more sense to populate that part of the > record at the end, not the start, of the checkpoint? I don't think that really makes a large difference. Unless we'd hold OidGenLock across the XLogInsert() - which we don't want to do for performance / deadlock reasons - we'd have a similar race. I think it's simply important to recognize (and better document!) that the values in the checkpoint record are from roughly the time of the REDO pointer and not from the end of the checkpoint. > >> I don't understand how it could be out-of-date at that point. But > >> obviously it is. > > > > A checkpoint logically "starts" at 1) in the above abbreviated > > CreateCheckPoint(), that's where recovery starts when starting up from > > that checkpoint. But inbetween 1) and 4) all other backends can continue > > to insert WAL, and it'll be replayed *before* the checkpoint's record > > itself. That means that if some backend generates a NEXTOID record > > between 2) and 4) (with largers checkpoints we're looking at minutes to > > an hour of time), it's effects will temporarily take effect (as in > > ShmemVariableCache->nextOid is updated), but XLOG_CHECKPOINT_ONLINE's > > replay will overwrite it unconditionally: > > void > > xlog_redo(XLogReaderState *record) > > { > > else if (info == XLOG_CHECKPOINT_ONLINE) > > { > > ... > > /* ... but still treat OID counter as exact */ > > LWLockAcquire(OidGenLock, LW_EXCLUSIVE); > > ShmemVariableCache->nextOid = checkPoint.nextOid; > > ShmemVariableCache->oidCount = 0; > > LWLockRelease(OidGenLock); > > > > Makes sense? > > So it seems like we should unconditionally **not** do that when > replaying XLOG_CHECKPOINT_ONLINE, right? I think we should actually just not do it during replay *at all*. It seems entirely sufficient to use ->nextOid in StartupXLOG when it has found the checkpoint to start up from:/* initialize shared memory variables from the checkpoint record */ShmemVariableCache->nextXid= checkPoint.nextXid;ShmemVariableCache->nextOid = checkPoint.nextOid;ShmemVariableCache->oidCount= 0; FWIW, I've commented out the relevant sections from xlog_redo and since then I've not been able to reproduce the issue. > But, to round this out, all of this is formally only a hint on where > to start OIDs so as to avoid performance problems as you busy-loop > looking for an unused OID. The real correctness bug is in the > hint-bit problem you discuss elsewhere that turns collisions into > corruption, and this bug just makes that other one reachable? That's my opinion at least. Greetings, Andres Freund
Jeff Janes <jeff.janes@gmail.com> writes: > But, another perhaps stupid question, why do we care what the value of > nextOid was at the start of the last successfully completed > checkpoint? The intended use of that field is to restore nextOid before replaying WAL. So it should correspond to the value at checkpoint start. I think what the code is doing now during replay is probably just brain fade :-( > Wouldn't it make more sense to populate that part of the > record at the end, not the start, of the checkpoint? The idea is that we should initialize the OID counter to something approximately right before we start replay, so that replay of OID- counter-advance records behaves sensibly. > But, to round this out, all of this is formally only a hint on where > to start OIDs so as to avoid performance problems as you busy-loop > looking for an unused OID. The real correctness bug is in the > hint-bit problem you discuss elsewhere that turns collisions into > corruption, and this bug just makes that other one reachable? Right, the theory is that being slightly off shouldn't matter. If that's wrong, we would have race-condition problems with either intended timing of sampling the OID counter. regards, tom lane
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Teodor Sigaev
Date:
>> Allow Pin/UnpinBuffer to operate in a lockfree manner. >> I get the errors: >> >> ERROR: attempted to delete invisible tuple >> ERROR: unexpected chunk number 1 (expected 2) for toast value Just reminder, you investigate "unexpected chunk number" problem, but, seems, we have another bug (first ERROR: attempted to delete invisible tuple). IMHO, it's a separate bug, not related to oid. Unfortunately, I've never seen such error on my notebook. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On May 11, 2016 7:06:05 AM PDT, Teodor Sigaev <teodor@sigaev.ru> wrote: >>> Allow Pin/UnpinBuffer to operate in a lockfree manner. >>> I get the errors: >>> >>> ERROR: attempted to delete invisible tuple >>> ERROR: unexpected chunk number 1 (expected 2) for toast value > >Just reminder, you investigate "unexpected chunk number" problem, but, >seems, we >have another bug (first ERROR: attempted to delete invisible tuple). >IMHO, it's >a separate bug, not related to oid. Unfortunately, I've never seen such >error on >my notebook. Same issue. If the dead tuple is noticed by heap_delete (when it should have deleted the live version elsewhere) you getthe invisible role error. The unexpected chunk bit only triggers with hint bit sets differently across chunks. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
From
Andres Freund
Date:
On 2016-05-10 17:58:33 -0700, Andres Freund wrote: > FWIW, I've commented out the relevant sections from xlog_redo and since > then I've not been able to reproduce the issue. A couple days of running later, and it hasn't yet re-triggered. So I assume that's indeed the fix.
On Mon, May 16, 2016 at 8:28 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-05-10 17:58:33 -0700, Andres Freund wrote: >> FWIW, I've commented out the relevant sections from xlog_redo and since >> then I've not been able to reproduce the issue. > > A couple days of running later, and it hasn't yet re-triggered. So I > assume that's indeed the fix. Yes, I've also run it to wrap-around a few times with that commented out, so I agree that that is a fix. But I don't think we should apply until we get the larger issue with hit-bits is fixed first, as then we are masking one problem by fixing another. Although it doesn't really matter to me, because I'm not using HEAD to run the tests anyway, as I can't get the error to trigger on HEAD. As a side note, using 48354581a49c30f5 and without commenting out the part of the replay code we are talking about, I also need track_io_timing to be on in order to get the error. I don't think there is any direct causation there, just another fiddly timing issue. I've tried fixing the hit-bit issue as well, but I'm afraid it is over my head. I'll be happy to test proposed patches, though. Cheers, Jeff
On 5/10/16 4:12 PM, Andres Freund wrote: > The catalog representation (as in pg_class.reltoastrelid) isn't entirely > clear to me. One way would be to invert pg_class.reltoastrelid's > meaning and have the toast table point to the table it stores values > for. That'd also open the potential of having one toast table per column > and such. FWIW, toast-per-column is something I have a use case for. Can we also consider using a per-toast-table sequence instead of OID? IIRC the generation mechanics of the two are similar, and that would greatly reduce the pressure on OID generation. Tom, were you around when sequences were added? I'm guessing that that was done in response to OIDs becoming a serious problem in user tables on larger installs; ISTM this is just the next iteration of them being a problem. (And I suspect the one after this will be pg_attribute or maybe pg_depend). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Tue, May 10, 2016 at 02:06:19PM -0400, Tom Lane wrote: > Sooner or later we are going to need to go to 8-byte TOAST object > identifiers. Maybe we should think about doing that sooner not > later rather than trying to invent some anti-wraparound solution > here. Yay! Is there any lift in separating TOAST OIDs from the rest? > In principle, you could support existing TOAST tables and pointers > containing 4-byte IDs in parallel with the new ones. > Not sure how pg_upgrade would handle it exactly though. This is yet another reason we should get away from in-place binary upgrade as a strategy. It's always been fragile, and it's only ever been justifiable on grounds of hardware economics that no longer obtain. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate