Thread: atomic pin/unpin causing errors

atomic pin/unpin causing errors

From
Jeff Janes
Date:
I've bisected the errors I was seeing, discussed in
http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com

It look like they first appear in:

commit 48354581a49c30f5757c203415aa8412d85b0f70
Author: Andres Freund <andres@anarazel.de>
Date:   Sun Apr 10 20:12:32 2016 -0700
   Allow Pin/UnpinBuffer to operate in a lockfree manner.


I get the errors:

ERROR:  attempted to delete invisible tuple
STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2

And also:

ERROR:  unexpected chunk number 1 (expected 2) for toast value
85223889 in pg_toast_16424
STATEMENT:  update foo set count=count+1 where text_array @> $1

Once these errors start occurring, they happen often.  Usually the
"attempted to delete invisible tuple" happens first.

These errors show up after about 9 hours of run time.  The timing is
predictable enough that I don't think it is a purely stochastic race
condition.  It seems like some counter variable is overflowing.  But
it is not the ShmemVariableCache->nextXid counter, as I previously
speculated.  This test does not advance that fast enough to for it to
wrap around within 9 hours of run time.  But I am at a loss of what
other variable it might be. Since the system goes through a crash and
recovery every few seconds, any backend-local counters or
shared-memory counters would get reset upon recovery.  Right?

I think the invisible tuple referred to might be a tuple in the toast
table, not in the parent table.

I don't see the problem with an cassert-enabled, probably because it
is just too slow to ever reach the point where the problem occurs.

Any suggestions about where or how to look?  I don't know if the
"attempted to delete invisible tuple" is the bug itself, or is just
tripping over corruption left behind by someone else.

(This was all run using Teodor's test-enabling patch
gin_alone_cleanup-4.patch, so as not to change horses in midstream.
Now that a version of that patch has been committed, I will try to
repeat this in HEAD)

Cheers,

Jeff



Re: atomic pin/unpin causing errors

From
Andres Freund
Date:
Hi,

On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> I've bisected the errors I was seeing, discussed in
> http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
> 
> It look like they first appear in:
> 
> commit 48354581a49c30f5757c203415aa8412d85b0f70
> Author: Andres Freund <andres@anarazel.de>
> Date:   Sun Apr 10 20:12:32 2016 -0700
> 
>     Allow Pin/UnpinBuffer to operate in a lockfree manner.
> 
> 
> I get the errors:
> 
> ERROR:  attempted to delete invisible tuple
> STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
> 
> And also:
> 
> ERROR:  unexpected chunk number 1 (expected 2) for toast value
> 85223889 in pg_toast_16424
> STATEMENT:  update foo set count=count+1 where text_array @> $1
> 
> Once these errors start occurring, they happen often.  Usually the
> "attempted to delete invisible tuple" happens first.

That kind of seems to implicate clog/vacuuming or something like that
being involved.


> These errors show up after about 9 hours of run time.  The timing is
> predictable enough that I don't think it is a purely stochastic race
> condition.

Hm. I've a bit of a hard time believing that such a timing could be
caused by the above patch. How confident that it's that patch, and not
just changed timing due to performance changes?  And you definitely can
only reproduce the problem with the regular crash cycles?


> It seems like some counter variable is overflowing.  But
> it is not the ShmemVariableCache->nextXid counter, as I previously
> speculated.  This test does not advance that fast enough to for it to
> wrap around within 9 hours of run time.  But I am at a loss of what
> other variable it might be. Since the system goes through a crash and
> recovery every few seconds, any backend-local counters or
> shared-memory counters would get reset upon recovery.  Right?

A lot of those counters will be re-set based on WAL contents. So if
they're corrupted once, several of them are prone to continue to be
corrupted.

Greetings,

Andres Freund



Re: atomic pin/unpin causing errors

From
Andres Freund
Date:
Hi Jeff,

On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> I've bisected the errors I was seeing, discussed in
> http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
> 
> It look like they first appear in:
> 
> commit 48354581a49c30f5757c203415aa8412d85b0f70
> Author: Andres Freund <andres@anarazel.de>
> Date:   Sun Apr 10 20:12:32 2016 -0700
> 
>     Allow Pin/UnpinBuffer to operate in a lockfree manner.
> 
> 
> I get the errors:
> 
> ERROR:  attempted to delete invisible tuple
> STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
> 
> And also:
> 
> ERROR:  unexpected chunk number 1 (expected 2) for toast value
> 85223889 in pg_toast_16424
> STATEMENT:  update foo set count=count+1 where text_array @> $1

Hm. I appear to have trouble reproducing this issue (continuing to try)
on master as of 8826d8507.  Is there any chance you could package up a
data directory after the issue hit?


> (This was all run using Teodor's test-enabling patch
> gin_alone_cleanup-4.patch, so as not to change horses in midstream.
> Now that a version of that patch has been committed, I will try to
> repeat this in HEAD)

Any news on that front?

Greetings,

Andres Freund



Re: atomic pin/unpin causing errors

From
Alexander Korotkov
Date:
On Wed, May 4, 2016 at 2:05 AM, Andres Freund <andres@anarazel.de> wrote:
On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> I've bisected the errors I was seeing, discussed in
> http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
>
> It look like they first appear in:
>
> commit 48354581a49c30f5757c203415aa8412d85b0f70
> Author: Andres Freund <andres@anarazel.de>
> Date:   Sun Apr 10 20:12:32 2016 -0700
>
>     Allow Pin/UnpinBuffer to operate in a lockfree manner.
>
>
> I get the errors:
>
> ERROR:  attempted to delete invisible tuple
> STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
>
> And also:
>
> ERROR:  unexpected chunk number 1 (expected 2) for toast value
> 85223889 in pg_toast_16424
> STATEMENT:  update foo set count=count+1 where text_array @> $1

Hm. I appear to have trouble reproducing this issue (continuing to try)
on master as of 8826d8507.  Is there any chance you could package up a
data directory after the issue hit?

FWIW, I'm also trying to reproduce it on big x86 machine on 9888b34f.
I'll write about results when get any.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: atomic pin/unpin causing errors

From
Teodor Sigaev
Date:
>> I get the errors:
>>
>> ERROR:  attempted to delete invisible tuple
>> STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
>>
>> And also:
>>
>> ERROR:  unexpected chunk number 1 (expected 2) for toast value
>> 85223889 in pg_toast_16424
>> STATEMENT:  update foo set count=count+1 where text_array @> $1
>
> Hm. I appear to have trouble reproducing this issue (continuing to try)
> on master as of 8826d8507.  Is there any chance you could package up a
> data directory after the issue hit?

I've got
ERROR:  unexpected chunk number 0 (expected 1) for toast value 10192986 in 
pg_toast_16424

The test required 10 hours to run on my notebook. postgresql was compiled with 
-O0 --enable-debug --enable-cassert.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: atomic pin/unpin causing errors

From
Andres Freund
Date:
On 2016-05-04 18:12:45 +0300, Teodor Sigaev wrote:
> > > I get the errors:
> > > 
> > > ERROR:  attempted to delete invisible tuple
> > > STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
> > > 
> > > And also:
> > > 
> > > ERROR:  unexpected chunk number 1 (expected 2) for toast value
> > > 85223889 in pg_toast_16424
> > > STATEMENT:  update foo set count=count+1 where text_array @> $1
> > 
> > Hm. I appear to have trouble reproducing this issue (continuing to try)
> > on master as of 8826d8507.  Is there any chance you could package up a
> > data directory after the issue hit?
> 
> I've got
> ERROR:  unexpected chunk number 0 (expected 1) for toast value 10192986 in
> pg_toast_16424
> 
> The test required 10 hours to run on my notebook. postgresql was compiled
> with -O0 --enable-debug --enable-cassert.

Interesting. I just ran a test for a good bit longer, till it failed due
to an nearing wraparound. Presumably because the crashes are too
frequent to finish vacuuming.

I did however, because Jeff said he coulnd't reproduce with cassert, use
an optimized build.  Wonder if there's some barrier related issue,
making this dependant on the compiler's exact code generation. That'd
explain why different people can reproduce it in different
circumstances.

Any chance you could package up that data directory for me to download?


Andres



Re: atomic pin/unpin causing errors

From
Jeff Janes
Date:
On Tue, May 3, 2016 at 4:05 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi Jeff,
>
> On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
>> I've bisected the errors I was seeing, discussed in
>> http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
>>
>> It look like they first appear in:
>>
>> commit 48354581a49c30f5757c203415aa8412d85b0f70
>> Author: Andres Freund <andres@anarazel.de>
>> Date:   Sun Apr 10 20:12:32 2016 -0700
>>
>>     Allow Pin/UnpinBuffer to operate in a lockfree manner.
>>
>>
>> I get the errors:
>>
>> ERROR:  attempted to delete invisible tuple
>> STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
>>
>> And also:
>>
>> ERROR:  unexpected chunk number 1 (expected 2) for toast value
>> 85223889 in pg_toast_16424
>> STATEMENT:  update foo set count=count+1 where text_array @> $1
>
> Hm. I appear to have trouble reproducing this issue (continuing to try)
> on master as of 8826d8507.  Is there any chance you could package up a
> data directory after the issue hit?

I'll look into.  I haven't been saving them, as they are very large
(tens of GB) by the time the errors happen.  In case I can't find a
way to transfer that much data, is there something I could do in situ
to debug it?

>
>
>> (This was all run using Teodor's test-enabling patch
>> gin_alone_cleanup-4.patch, so as not to change horses in midstream.
>> Now that a version of that patch has been committed, I will try to
>> repeat this in HEAD)
>
> Any news on that front?

I couldn't reproduce it in 82881b2b432c9433b45a (which is what HEAD
was at the time).

The last commit I saw the problem in was 8f1911d5e6d5, and in that
commit it took longer than usual to see the error, and I never saw at
all in one run (which lead me down the wrong path in git bisect) but
then saw errors upon another try.  Up until that commit, it seemed to
give the errors like clockwork, always after 8 to 10 hours of running.

I also have never seen the errors with the crashing turned off.  I
even tried it with autovac off and
autovacuum_freeze_max_age=1500000000 (to emulate the way vacuum never
gets a chance to run to completion in the crashing mode) and then I
don't get any errors up to the point I run out of disk space.

Cheers,

Jeff



Re: atomic pin/unpin causing errors

From
Andres Freund
Date:
Hi Jeff,

On 2016-05-04 14:00:01 -0700, Jeff Janes wrote:
> On Tue, May 3, 2016 at 4:05 PM, Andres Freund <andres@anarazel.de> wrote:
> > Hm. I appear to have trouble reproducing this issue (continuing to try)
> > on master as of 8826d8507.  Is there any chance you could package up a
> > data directory after the issue hit?
> 
> I'll look into.  I haven't been saving them, as they are very large
> (tens of GB) by the time the errors happen.

Hm. Any chance that's SSH accessible?


What compiler-version & OS are you using, with what exact
CFLAGS/configure input? I'd like to try to replicate the setup as close
as possible; in the hope of just making it locally reproducible.


> In case I can't find a way to transfer that much data, is there
> something I could do in situ to debug it?

Yes. It'd be good to get a look at the borked page/tuple with
pageinspect. That might require some manual search to find the affected
tuple, and possibly the problem is transient.  I was wondering whether
we could just put an Assert() into those error messages, to get a stack
dump. But unfortunately your tooling would likely generate far too many
of those.

Andres



Re: atomic pin/unpin causing errors

From
Teodor Sigaev
Date:
> Any chance you could package up that data directory for me to download?

Sent by personal email to Alexander, Andres and Jeff

In /var/log/message I found

May  4 22:04:07 xor kernel: pid 14010 (postgres), uid 1001: exited on signal 6 
(core dumped)
May  4 22:04:25 xor kernel: pid 14032 (postgres), uid 1001: exited on signal 11 
(core dumped)
May  4 22:04:52 xor kernel: pid 14037 (postgres), uid 1001: exited on signal 6 
(core dumped)


Sometimes postgres is crashed with SIGSEGV signal instead of SIGABRT (which 
comes form abort() in assert)

I'll try to get a coredump after SIGSEGV, but it could take a time.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: atomic pin/unpin causing errors

From
Teodor Sigaev
Date:
> I'll try to get a coredump after SIGSEGV, but it could take a time.

Got it!

#0  0x00000008014321d7 in sbrk () from /lib/libc.so.7
#1  0x0000000801431ddd in sbrk () from /lib/libc.so.7
#2  0x000000080142e5bb in sbrk () from /lib/libc.so.7
#3  0x000000080142e085 in sbrk () from /lib/libc.so.7
#4  0x000000080142de28 in sbrk () from /lib/libc.so.7
#5  0x000000080142e1cf in sbrk () from /lib/libc.so.7
#6  0x0000000801439815 in free () from /lib/libc.so.7
#7  0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7
#8  0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7
#9  0x000000080144525c in exit () from /lib/libc.so.7
#10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623
#11 <signal handler called>
#12 0x0000000801431847 in sbrk () from /lib/libc.so.7
#13 0x0000000801431522 in sbrk () from /lib/libc.so.7
#14 0x000000080142d47f in sbrk () from /lib/libc.so.7
#15 0x0000000801434628 in malloc () from /lib/libc.so.7
#16 0x0000000000aca278 in AllocSetAlloc (context=0x801c0bb88, size=24) at aset.c:853
#17 0x0000000000acca0e in MemoryContextAlloc (context=0x801c0bb88, size=24) at 
mcxt.c:764
#18 0x0000000000aebdb8 in PushActiveSnapshot (snap=0xf4ae10) at snapmgr.c:652
#19 0x00000000008e54bd in exec_bind_message (input_message=0x7fffffffdf60) at 
postgres.c:1602
#20 0x00000000008e3957 in PostgresMain (argc=1, argv=0x801d3c968, 
dbname=0x801d3c948 "teodor", username=0x801d3c928 "teodor") at postgres.c:4105
#21 0x0000000000839744 in BackendRun (port=0x801c991c0) at postmaster.c:4258
#22 0x0000000000838d54 in BackendStartup (port=0x801c991c0) at postmaster.c:3932
#23 0x0000000000835617 in ServerLoop () at postmaster.c:1690
#24 0x0000000000832c69 in PostmasterMain (argc=4, argv=0x7fffffffe420) at 
postmaster.c:1298
#25 0x000000000075f228 in main (argc=4, argv=0x7fffffffe420) at main.c:228

Seems, we have some memory corruption, but it could either separate or the same 
problem.

Another one:

#0  0x00000008014321d7 in sbrk () from /lib/libc.so.7
#1  0x0000000801431ddd in sbrk () from /lib/libc.so.7
#2  0x000000080142e5bb in sbrk () from /lib/libc.so.7
#3  0x000000080142e085 in sbrk () from /lib/libc.so.7
#4  0x000000080142de28 in sbrk () from /lib/libc.so.7
#5  0x000000080142e1cf in sbrk () from /lib/libc.so.7
#6  0x0000000801439815 in free () from /lib/libc.so.7
#7  0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7
#8  0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7
#9  0x000000080144525c in exit () from /lib/libc.so.7
#10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623
#11 <signal handler called>
#12 0x000000080143277a in sbrk () from /lib/libc.so.7
#13 0x00000008014318b5 in sbrk () from /lib/libc.so.7
#14 0x000000080142e483 in sbrk () from /lib/libc.so.7
#15 0x000000080142e75b in sbrk () from /lib/libc.so.7
#16 0x00000008014398bd in free () from /lib/libc.so.7
#17 0x0000000000aca676 in AllocSetFree (context=0x801e710d0, 
pointer=0x801e65038) at aset.c:976
#18 0x0000000000acbe93 in pfree (pointer=0x801e65038) at mcxt.c:1015
#19 0x00000000004a7986 in ginendscan (scan=0x801e61de0) at ginscan.c:445
#20 0x0000000000504818 in index_endscan (scan=0x801e61de0) at indexam.c:339
#21 0x0000000000719d21 in ExecEndBitmapIndexScan (node=0x801e619c8) at 
nodeBitmapIndexscan.c:183
#22 0x00000000006fce9e in ExecEndNode (node=0x801e619c8) at execProcnode.c:685
#23 0x0000000000719195 in ExecEndBitmapHeapScan (node=0x801d63700) at 
nodeBitmapHeapscan.c:508
#24 0x00000000006fceaf in ExecEndNode (node=0x801d63700) at execProcnode.c:689
#25 0x000000000072b64a in ExecEndModifyTable (node=0x801d632a0) at 
nodeModifyTable.c:1978
#26 0x00000000006fcde3 in ExecEndNode (node=0x801d632a0) at execProcnode.c:638
#27 0x00000000006f6ed9 in ExecEndPlan (planstate=0x801d632a0, 
estate=0x801d63038) at execMain.c:1451
#28 0x00000000006f6e56 in standard_ExecutorEnd (queryDesc=0x801e42af0) at 
execMain.c:468
#29 0x00000008020038f2 in pgss_ExecutorEnd (queryDesc=0x801e42af0) at 
pg_stat_statements.c:938
#30 0x00000000006f6d3c in ExecutorEnd (queryDesc=0x801e42af0) at execMain.c:437
#31 0x00000000008ea387 in ProcessQuery (plan=0x801e43898, sourceText=0x801e42838 
"update foo set count=count+1 where text_array @> $1", params=0x801e428b8, 
dest=0xf3fcc8,    completionTag=0x7fffffffdd00 "UPDATE 1") at pquery.c:230
#32 0x00000000008e9540 in PortalRunMulti (portal=0x801dc5038, isTopLevel=1 
'\001', dest=0xf3fcc8, altdest=0xf3fcc8, completionTag=0x7fffffffdd00 "UPDATE 
1") at pquery.c:1267
#33 0x00000000008e8cd6 in PortalRun (portal=0x801dc5038, 
count=9223372036854775807, isTopLevel=1 '\001', dest=0x801c96450, 
altdest=0x801c96450,    completionTag=0x7fffffffdd00 "UPDATE 1") at pquery.c:813
#34 0x00000000008e61ef in exec_execute_message (portal_name=0x801c96038 "", 
max_rows=9223372036854775807) at postgres.c:1979
#35 0x00000000008e39ae in PostgresMain (argc=1, argv=0x801d56bc8, 
dbname=0x801d56ba8 "teodor", username=0x801d56b88 "teodor") at postgres.c:4122
#36 0x0000000000839744 in BackendRun (port=0x801d571c0) at postmaster.c:4258
#37 0x0000000000838d54 in BackendStartup (port=0x801d571c0) at postmaster.c:3932
#38 0x0000000000835617 in ServerLoop () at postmaster.c:1690
#39 0x0000000000832c69 in PostmasterMain (argc=4, argv=0x7fffffffe420) at 
postmaster.c:1298
#40 0x000000000075f228 in main (argc=4, argv=0x7fffffffe420) at main.c:228


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: quickdie doing memory allocations (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
Hi Teodor,

Thanks for analyzing this.

On 2016-05-05 13:50:09 +0300, Teodor Sigaev wrote:
> > I'll try to get a coredump after SIGSEGV, but it could take a time.
> 
> Got it!
> 
> #0  0x00000008014321d7 in sbrk () from /lib/libc.so.7
> #1  0x0000000801431ddd in sbrk () from /lib/libc.so.7
> #2  0x000000080142e5bb in sbrk () from /lib/libc.so.7
> #3  0x000000080142e085 in sbrk () from /lib/libc.so.7
> #4  0x000000080142de28 in sbrk () from /lib/libc.so.7
> #5  0x000000080142e1cf in sbrk () from /lib/libc.so.7
> #6  0x0000000801439815 in free () from /lib/libc.so.7
> #7  0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7
> #8  0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7
> #9  0x000000080144525c in exit () from /lib/libc.so.7
> #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623
> #11 <signal handler called>
> #12 0x0000000801431847 in sbrk () from /lib/libc.so.7
> #13 0x0000000801431522 in sbrk () from /lib/libc.so.7
> #14 0x000000080142d47f in sbrk () from /lib/libc.so.7
> #15 0x0000000801434628 in malloc () from /lib/libc.so.7
> #16 0x0000000000aca278 in AllocSetAlloc (context=0x801c0bb88, size=24) at aset.c:853
> #17 0x0000000000acca0e in MemoryContextAlloc (context=0x801c0bb88, size=24)
> at mcxt.c:764
> #18 0x0000000000aebdb8 in PushActiveSnapshot (snap=0xf4ae10) at snapmgr.c:652
> #19 0x00000000008e54bd in exec_bind_message (input_message=0x7fffffffdf60)
> at postgres.c:1602
> #20 0x00000000008e3957 in PostgresMain (argc=1, argv=0x801d3c968,
> dbname=0x801d3c948 "teodor", username=0x801d3c928 "teodor") at
> postgres.c:4105
> #21 0x0000000000839744 in BackendRun (port=0x801c991c0) at postmaster.c:4258
> #22 0x0000000000838d54 in BackendStartup (port=0x801c991c0) at postmaster.c:3932
> #23 0x0000000000835617 in ServerLoop () at postmaster.c:1690
> #24 0x0000000000832c69 in PostmasterMain (argc=4, argv=0x7fffffffe420) at
> postmaster.c:1298
> #25 0x000000000075f228 in main (argc=4, argv=0x7fffffffe420) at main.c:228
> 
> Seems, we have some memory corruption, but it could either separate or the
> same problem.

That looks like independent issue, namely that we're trigger memory
allocations from a signal handler (see frames 12, 11, 10, 9). Presumably
due to system registered atexit handlers.  I suspect we should be using
_exit() here?  Tom?

Greetings,

Andres Freund



Re: atomic pin/unpin causing errors

From
Andres Freund
Date:
Hi Jeff,

On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> I don't see the problem with an cassert-enabled, probably because it
> is just too slow to ever reach the point where the problem occurs.

Running the test with cassert enabled I actually get assertion failures,
due to the FATAL you added.

#1  0x0000000000958dde in ExceptionalCondition (conditionName=0xb36c2a "!(RefCountErrors == 0)", errorType=0xb361af
"FailedAssertion",   fileName=0xb36170 "/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c",
lineNumber=2506)at /home/admin/src/postgresql/src/backend/utils/error/assert.c:54
 
#2  0x00000000007c9fc9 in CheckForBufferLeaks () at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2506
#3  0x00000000007c9f09 in AtProcExit_Buffers (code=1, arg=0) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2459
#4  0x00000000007d927f in shmem_exit (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:261
#5  0x00000000007d90dd in proc_exit_prepare (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:185
#6  0x00000000007d904b in proc_exit (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:102
#7  0x000000000095958d in errfinish (dummy=0) at /home/admin/src/postgresql/src/backend/utils/error/elog.c:543
#8  0x000000000080214b in mdwrite (reln=0x2e8b4a8, forknum=MAIN_FORKNUM, blocknum=154, buffer=0x2e8e5a8 "", skipFsync=0
'\000')  at /home/admin/src/postgresql/src/backend/storage/smgr/md.c:832
 
#9  0x0000000000804633 in smgrwrite (reln=0x2e8b4a8, forknum=MAIN_FORKNUM, blocknum=154, buffer=0x2e8e5a8 "",
skipFsync=0'\000')   at /home/admin/src/postgresql/src/backend/storage/smgr/smgr.c:650
 
#10 0x00000000007ca548 in FlushBuffer (buf=0x7f0285955330, reln=0x2e8b4a8) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2734
#11 0x00000000007c9d5a in SyncOneBuffer (buf_id=2503, skip_recently_used=0 '\000', wb_context=0x7ffe7305d290) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2377
#12 0x00000000007c964e in BufferSync (flags=64) at /home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:1967
#13 0x00000000007ca185 in CheckPointBuffers (flags=64) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2561
#14 0x000000000052d497 in CheckPointGuts (checkPointRedo=382762776, flags=64) at
/home/admin/src/postgresql/src/backend/access/transam/xlog.c:8644
#15 0x000000000052cede in CreateCheckPoint (flags=64) at
/home/admin/src/postgresql/src/backend/access/transam/xlog.c:8430
#16 0x00000000007706ac in CheckpointerMain () at /home/admin/src/postgresql/src/backend/postmaster/checkpointer.c:488
#17 0x000000000053e0d5 in AuxiliaryProcessMain (argc=2, argv=0x7ffe7305ea40) at
/home/admin/src/postgresql/src/backend/bootstrap/bootstrap.c:429
#18 0x000000000078099f in StartChildProcess (type=CheckpointerProcess) at
/home/admin/src/postgresql/src/backend/postmaster/postmaster.c:5227
#19 0x000000000077dcc3 in reaper (postgres_signal_arg=17) at
/home/admin/src/postgresql/src/backend/postmaster/postmaster.c:2781
#20 <signal handler called>
#21 0x00007f028ebbdac3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#22 0x000000000077c049 in ServerLoop () at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:1654
#23 0x000000000077b7a9 in PostmasterMain (argc=4, argv=0x2e49f20) at
/home/admin/src/postgresql/src/backend/postmaster/postmaster.c:1298
#24 0x00000000006c5849 in main (argc=4, argv=0x2e49f20) at /home/admin/src/postgresql/src/backend/main/main.c:228

You didn't see those?


The trigger here appears to be that the checkpointer doesn't have
on-exit callback similar to a normal backend's ShutdownPostgres() et al,
and thus doesn't trigger a resource owner release.  The normal ERROR
path has    /* buffer pins are released here: */    ResourceOwnerRelease(CurrentResourceOwner,
RESOURCE_RELEASE_BEFORE_LOCKS,                        false, true);    /* we needn't bother with the other
ResourceOwnerReleasephases */
 


That clearly is a bug. But I'm not immediately seing how this could
trigger the corruption issue you observed.

Greetings,

Andres Freund



Re: atomic pin/unpin causing errors

From
Andres Freund
Date:
On 2016-05-04 18:12:45 +0300, Teodor Sigaev wrote:
> > > I get the errors:
> > > 
> > > ERROR:  attempted to delete invisible tuple
> > > STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
> > > 
> > > And also:
> > > 
> > > ERROR:  unexpected chunk number 1 (expected 2) for toast value
> > > 85223889 in pg_toast_16424
> > > STATEMENT:  update foo set count=count+1 where text_array @> $1
> > 
> > Hm. I appear to have trouble reproducing this issue (continuing to try)
> > on master as of 8826d8507.  Is there any chance you could package up a
> > data directory after the issue hit?
> 
> I've got
> ERROR:  unexpected chunk number 0 (expected 1) for toast value 10192986 in
> pg_toast_16424
> 
> The test required 10 hours to run on my notebook. postgresql was compiled
> with -O0 --enable-debug --enable-cassert.

Hm. And you're not seeing the asserts I reported in
http://archives.postgresql.org/message-id/20160505185246.2i7qftadwhzewykj%40alap3.anarazel.de
?

Greetings,

Andres Freund



Re: atomic pin/unpin causing errors

From
Teodor Sigaev
Date:
> Hm. And you're not seeing the asserts I reported in
> http://archives.postgresql.org/message-id/20160505185246.2i7qftadwhzewykj%40alap3.anarazel.de
> ?
>
I see it a lot, but I think that is a result of ereport(FATAL) after 
FileWrite(BLCKSZ/3) added by Jeff.

Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Andres Freund <andres@anarazel.de> writes:
>> #0  0x00000008014321d7 in sbrk () from /lib/libc.so.7
>> #1  0x0000000801431ddd in sbrk () from /lib/libc.so.7
>> #2  0x000000080142e5bb in sbrk () from /lib/libc.so.7
>> #3  0x000000080142e085 in sbrk () from /lib/libc.so.7
>> #4  0x000000080142de28 in sbrk () from /lib/libc.so.7
>> #5  0x000000080142e1cf in sbrk () from /lib/libc.so.7
>> #6  0x0000000801439815 in free () from /lib/libc.so.7
>> #7  0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7
>> #8  0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7
>> #9  0x000000080144525c in exit () from /lib/libc.so.7
>> #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623
>> #11 <signal handler called>
>> #12 0x0000000801431847 in sbrk () from /lib/libc.so.7

> That looks like independent issue, namely that we're trigger memory
> allocations from a signal handler (see frames 12, 11, 10, 9). Presumably
> due to system registered atexit handlers.  I suspect we should be using
> _exit() here?  Tom?

I don't think that would improve matters.  In the first place, if we use
_exit() here that might encourage third-party extension authors to believe
they should use _exit(), which would be bad.  In the second place,
we don't know what it is we're skipping by not running atexit handlers,
and again that could be bad.  We don't like people trying to bypass our
on-exit code, why would anyone else?  In the third place, by the time we
get to the exit() call we've already exposed ourselves to a whole lot of
such hazards by running ereport() (including sending a message to the
client!).  In the fourth place, if we've received a quickdie interrupt,
it doesn't actually matter if the process crashes; we just want it to
quit ASAP.

So I'd say that this is just a cosmetic problem and that trying to fix
it is likely to make things worse.
        regards, tom lane



Re: quickdie doing memory allocations (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
Hi,

On 2016-05-05 15:56:45 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> #0  0x00000008014321d7 in sbrk () from /lib/libc.so.7
> >> #1  0x0000000801431ddd in sbrk () from /lib/libc.so.7
> >> #2  0x000000080142e5bb in sbrk () from /lib/libc.so.7
> >> #3  0x000000080142e085 in sbrk () from /lib/libc.so.7
> >> #4  0x000000080142de28 in sbrk () from /lib/libc.so.7
> >> #5  0x000000080142e1cf in sbrk () from /lib/libc.so.7
> >> #6  0x0000000801439815 in free () from /lib/libc.so.7
> >> #7  0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7
> >> #8  0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7
> >> #9  0x000000080144525c in exit () from /lib/libc.so.7
> >> #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623
> >> #11 <signal handler called>
> >> #12 0x0000000801431847 in sbrk () from /lib/libc.so.7
> 
> > That looks like independent issue, namely that we're trigger memory
> > allocations from a signal handler (see frames 12, 11, 10, 9). Presumably
> > due to system registered atexit handlers.  I suspect we should be using
> > _exit() here?  Tom?
> 
> I don't think that would improve matters.  In the first place, if we use
> _exit() here that might encourage third-party extension authors to believe
> they should use _exit(), which would be bad.

The sourcetree already has a number of _exit() calls, so I don't think
that'd make a meaningfull difference.


> In the second place, we don't know what it is we're skipping by not
> running atexit handlers, and again that could be bad.

I've a hard time coming up with a scenario where that'd be a problem in
a PANIC case. Isn't it pretty common to use _exit after fatal errors
(and forks)?


> In the third place, by the time we
> get to the exit() call we've already exposed ourselves to a whole lot of
> such hazards by running ereport() (including sending a message to the
> client!).

True. And that's not good. But the magic of ErrorContext shields us from
a fair amount of issues.


> In the fourth place, if we've received a quickdie interrupt,
> it doesn't actually matter if the process crashes; we just want it to
> quit ASAP.

If it always were crashing, that'd be somewhat fine. But sbrk internally
uses mutexes, so this can result in processes getting stuck. And that is
a problem.  There've actually been reports about that every now and then.

Greetings,

Andres Freund



On Thu, May 5, 2016 at 11:51 AM, Andres Freund <andres@anarazel.de> wrote:
>> #7  0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7
>> #8  0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7
>> #9  0x000000080144525c in exit () from /lib/libc.so.7
>> #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623

Eh, this doesn't this __cxa_finalize() stuff suggest that some C++
code was linked into the backend?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: quickdie doing memory allocations (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-05 16:32:38 -0400, Robert Haas wrote:
> On Thu, May 5, 2016 at 11:51 AM, Andres Freund <andres@anarazel.de> wrote:
> >> #7  0x000000080149e3d6 in nsdispatch () from /lib/libc.so.7
> >> #8  0x00000008014a41c6 in __cxa_finalize () from /lib/libc.so.7
> >> #9  0x000000080144525c in exit () from /lib/libc.so.7
> >> #10 0x00000000008e1bc2 in quickdie (postgres_signal_arg=3) at postgres.c:2623
> 
> Eh, this doesn't this __cxa_finalize() stuff suggest that some C++
> code was linked into the backend?

IIRC __cxa_finalize also handles atexit() (and gcc
__attribute__((destructor))).

Greetings,

Andres Freund



Re: Missing error handling for FATALs in checkpointer/bgwriter

From
Andres Freund
Date:
On 2016-05-05 11:52:46 -0700, Andres Freund wrote:
> Hi Jeff,
> 
> On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> > I don't see the problem with an cassert-enabled, probably because it
> > is just too slow to ever reach the point where the problem occurs.
> 
> Running the test with cassert enabled I actually get assertion failures,
> due to the FATAL you added.
> 
> #1  0x0000000000958dde in ExceptionalCondition (conditionName=0xb36c2a "!(RefCountErrors == 0)", errorType=0xb361af
"FailedAssertion",
 
>     fileName=0xb36170 "/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c", lineNumber=2506) at
/home/admin/src/postgresql/src/backend/utils/error/assert.c:54
> #2  0x00000000007c9fc9 in CheckForBufferLeaks () at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2506
> #3  0x00000000007c9f09 in AtProcExit_Buffers (code=1, arg=0) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2459
> #4  0x00000000007d927f in shmem_exit (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:261
> #5  0x00000000007d90dd in proc_exit_prepare (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:185
> #6  0x00000000007d904b in proc_exit (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:102
> #7  0x000000000095958d in errfinish (dummy=0) at /home/admin/src/postgresql/src/backend/utils/error/elog.c:543
> #8  0x000000000080214b in mdwrite (reln=0x2e8b4a8, forknum=MAIN_FORKNUM, blocknum=154, buffer=0x2e8e5a8 "",
skipFsync=0'\000')
 
>     at /home/admin/src/postgresql/src/backend/storage/smgr/md.c:832
> #9  0x0000000000804633 in smgrwrite (reln=0x2e8b4a8, forknum=MAIN_FORKNUM, blocknum=154, buffer=0x2e8e5a8 "",
skipFsync=0'\000')
 
>     at /home/admin/src/postgresql/src/backend/storage/smgr/smgr.c:650
> #10 0x00000000007ca548 in FlushBuffer (buf=0x7f0285955330, reln=0x2e8b4a8) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2734
> #11 0x00000000007c9d5a in SyncOneBuffer (buf_id=2503, skip_recently_used=0 '\000', wb_context=0x7ffe7305d290) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2377
> #12 0x00000000007c964e in BufferSync (flags=64) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:1967
> #13 0x00000000007ca185 in CheckPointBuffers (flags=64) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2561
> #14 0x000000000052d497 in CheckPointGuts (checkPointRedo=382762776, flags=64) at
/home/admin/src/postgresql/src/backend/access/transam/xlog.c:8644
> #15 0x000000000052cede in CreateCheckPoint (flags=64) at
/home/admin/src/postgresql/src/backend/access/transam/xlog.c:8430
> #16 0x00000000007706ac in CheckpointerMain () at
/home/admin/src/postgresql/src/backend/postmaster/checkpointer.c:488
> #17 0x000000000053e0d5 in AuxiliaryProcessMain (argc=2, argv=0x7ffe7305ea40) at
/home/admin/src/postgresql/src/backend/bootstrap/bootstrap.c:429
> #18 0x000000000078099f in StartChildProcess (type=CheckpointerProcess) at
/home/admin/src/postgresql/src/backend/postmaster/postmaster.c:5227
> #19 0x000000000077dcc3 in reaper (postgres_signal_arg=17) at
/home/admin/src/postgresql/src/backend/postmaster/postmaster.c:2781
> #20 <signal handler called>
> #21 0x00007f028ebbdac3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
> #22 0x000000000077c049 in ServerLoop () at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:1654
> #23 0x000000000077b7a9 in PostmasterMain (argc=4, argv=0x2e49f20) at
/home/admin/src/postgresql/src/backend/postmaster/postmaster.c:1298
> #24 0x00000000006c5849 in main (argc=4, argv=0x2e49f20) at /home/admin/src/postgresql/src/backend/main/main.c:228
> 
> You didn't see those?
> 
> 
> The trigger here appears to be that the checkpointer doesn't have
> on-exit callback similar to a normal backend's ShutdownPostgres() et al,
> and thus doesn't trigger a resource owner release.  The normal ERROR
> path has
>         /* buffer pins are released here: */
>         ResourceOwnerRelease(CurrentResourceOwner,
>                              RESOURCE_RELEASE_BEFORE_LOCKS,
>                              false, true);
>         /* we needn't bother with the other ResourceOwnerRelease phases */
> 
> That clearly is a bug. But I'm not immediately seing how this could
> trigger the corruption issue you observed.


The same issue exists in bgwriter afaics. ISTM that we need to provide
an before_shmem_exit (or on_shmem_exit?) handler for both which essentially does/* * These operations are really just a
minimalsubset of * AbortTransaction().  We don't have very many resources to worry * about in bgwriter, but we do have
LWLocks,buffers, and temp files. */LWLockReleaseAll();AbortBufferIO();UnlockBuffers();/* buffer pins are released here:
*/ResourceOwnerRelease(CurrentResourceOwner,                    RESOURCE_RELEASE_BEFORE_LOCKS,
false,true);
 
it looks to me like that should be backpatched?

There's some question about how to make the ordering
vs. AtProcExit_Buffers robust; which is why I'm above explicitly doing
LWLockReleaseAll/AbortBufferIO/UnlockBuffers.

Any better ideas?



Re: atomic pin/unpin causing errors

From
Jeff Janes
Date:
On Thu, May 5, 2016 at 11:52 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi Jeff,
>
> On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
>> I don't see the problem with an cassert-enabled, probably because it
>> is just too slow to ever reach the point where the problem occurs.
>
> Running the test with cassert enabled I actually get assertion failures,
> due to the FATAL you added.
>
> #1  0x0000000000958dde in ExceptionalCondition (conditionName=0xb36c2a "!(RefCountErrors == 0)", errorType=0xb361af
"FailedAssertion",
>     fileName=0xb36170 "/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c", lineNumber=2506) at
/home/admin/src/postgresql/src/backend/utils/error/assert.c:54
> #2  0x00000000007c9fc9 in CheckForBufferLeaks () at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2506
...
>
> You didn't see those?

Yes, I have been seeing those on assert-enabled builds going back as
far as I can remember (long before this particular problem started
showing up).  I just assumed it was a natural consequence of throwing
an ERROR from inside a critical section.  I never really understood
it, why would a panicking process bother to check for buffer leaks in
the first place?  It is leaking everything, which is why the entire
system has to be brought down immediately.

I have been trying (and failing) to reproduce the problem in more
recent releases, with and without cassert.  Here is pg_config output
of one of my current attempts:

BINDIR = /home/jjanes/pgsql/torn_bisect/bin
DOCDIR = /home/jjanes/pgsql/torn_bisect/share/doc
HTMLDIR = /home/jjanes/pgsql/torn_bisect/share/doc
INCLUDEDIR = /home/jjanes/pgsql/torn_bisect/include
PKGINCLUDEDIR = /home/jjanes/pgsql/torn_bisect/include
INCLUDEDIR-SERVER = /home/jjanes/pgsql/torn_bisect/include/server
LIBDIR = /home/jjanes/pgsql/torn_bisect/lib
PKGLIBDIR = /home/jjanes/pgsql/torn_bisect/lib
LOCALEDIR = /home/jjanes/pgsql/torn_bisect/share/locale
MANDIR = /home/jjanes/pgsql/torn_bisect/share/man
SHAREDIR = /home/jjanes/pgsql/torn_bisect/share
SYSCONFDIR = /home/jjanes/pgsql/torn_bisect/etc
PGXS = /home/jjanes/pgsql/torn_bisect/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = 'CFLAGS=-ggdb' '--with-extra-version=-c1543a8'
'--enable-debug' '--with-libxml' '--with-perl' '--with-python'
'--with-ldap' '--with-openssl' '--with-gssapi' '--enable-cassert'
'--prefix=/home/jjanes/pgsql/torn_bisect/'
CC = gcc
CPPFLAGS = -DFRONTEND -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -ggdb
CFLAGS_SL = -fpic
LDFLAGS = -L../../src/common -Wl,--as-needed
-Wl,-rpath,'/home/jjanes/pgsql/torn_bisect/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lxml2 -lssl -lcrypto -lgssapi_krb5 -lz
-lreadline -lrt -lcrypt -ldl -lm
VERSION = PostgreSQL 9.6devel-c1543a8

The only difference between this and the ones that did find the ERR
would be toggling --enable-cassert and changing which git commit was
used (and manually applying the gin_alone patch when testing commits
that precede that one's committal.

Linux:  2.6.32-573.22.1.el6.x86_64 #1 SMP Wed Mar 23 03:35:39 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux

/proc/cpu_info:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-4640 v2 @ 2.20GHz
stepping        : 4
microcode       : 4294967295
cpu MHz         : 2199.933
cache size      : 20480 KB
physical id     : 0
siblings        : 8
core id         : 7
cpu cores       : 8
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm
constant_tsc rep_good unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1
sse4_2 popcnt aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt
fsgsbase smep erms
bogomips        : 4399.86
clflush size    : 64
cache_alignment : 64
address sizes   : 42 bits physical, 48 bits virtual
power management:

Cheers,

Jeff



Re: atomic pin/unpin causing errors

From
Andres Freund
Date:
On 2016-05-06 11:15:03 -0700, Jeff Janes wrote:
> > Running the test with cassert enabled I actually get assertion failures,
> > due to the FATAL you added.
> >
> > #1  0x0000000000958dde in ExceptionalCondition (conditionName=0xb36c2a "!(RefCountErrors == 0)", errorType=0xb361af
"FailedAssertion",
> >     fileName=0xb36170 "/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c", lineNumber=2506) at
/home/admin/src/postgresql/src/backend/utils/error/assert.c:54
> > #2  0x00000000007c9fc9 in CheckForBufferLeaks () at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2506
> ...
> >
> > You didn't see those?
> 
> Yes, I have been seeing those on assert-enabled builds going back as
> far as I can remember (long before this particular problem started
> showing up). > \I just assumed it was a natural consequence of throwing
> an ERROR from inside a critical section.

There's no critical section here...  The reason we're restarting is
because the FATAL makes the checkpointer exit, which postmaster treats
as a cause to trigger a PANIC:            /*             * Any unexpected exit of the checkpointer (including FATAL
       * exit) is treated as a crash.             */            HandleChildCrash(pid, exitstatus,
     _("checkpointer process"));
 


> I never really understood
> it, why would a panicking process bother to check for buffer leaks in
> the first place?  It is leaking everything, which is why the entire
> system has to be brought down immediately.

Panicing ones don't...


> I have been trying (and failing) to reproduce the problem in more
> recent releases, with and without cassert.  Here is pg_config output
> of one of my current attempts:

If you say "recent releases" you mean that you've not been able to
reproduce it in 9.5, 9.4, ..., not that the issue has vanished in
master, right?


Greetings,

Andres Freund



Re: atomic pin/unpin causing errors

From
Jeff Janes
Date:
On Fri, May 6, 2016 at 11:24 AM, Andres Freund <andres@anarazel.de> wrote:
>
>> I have been trying (and failing) to reproduce the problem in more
>> recent releases, with and without cassert.  Here is pg_config output
>> of one of my current attempts:
>
> If you say "recent releases" you mean that you've not been able to
> reproduce it in 9.5, 9.4, ..., not that the issue has vanished in
> master, right?

Sorry, I meant recent commits, not releases.  I have not been able to
reproduce it in master (anything after 8f1911d5e6), but I also can't
find the commit which fixed it, because I don't know how many hours of
running without an error is enough to declare it good.  I've already
gotten burned by that by declaring 8f1911d5e6 good once, only to find
out it was still bad just with a lower probability of finding it.

I also can't reproduce it in 9.5, but I wouldn't expect it to given
the large changes in the way GIN indexes operates between 9.5 and 9.6,
which completely changes the timing and types of races (even ones
outside of the GIN code) one might expect to see.  I'm pretty sure I
would have caught it during 9.5's beta if it were findable with this
test, as this test was run back then in pretty close to the current
form.  I'm not saying their isn't a bug in 9.5, just that if there is
this test is not efficient at invoking it.

So the issue *has* vanished in master, but without knowing what fixed
it I have no confidence it was actually fixed, rather than the test
just stopped being effective.

What I plan on doing now is going back to the part of the commit
history where I could reproduce it reliably, and start throwing
unnecessary things of the ./configure and the postgresql.conf to see
what the minimal configuration is at which I can reproduce it.

Sorry for the confusion.

Cheers,

Jeff



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
Hi,

On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> I've bisected the errors I was seeing, discussed in
> http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
>
> It look like they first appear in:
>
> commit 48354581a49c30f5757c203415aa8412d85b0f70
> Author: Andres Freund <andres@anarazel.de>
> Date:   Sun Apr 10 20:12:32 2016 -0700
>
>     Allow Pin/UnpinBuffer to operate in a lockfree manner.
>
>
> I get the errors:
>
> ERROR:  attempted to delete invisible tuple
> STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
>
> And also:
>
> ERROR:  unexpected chunk number 1 (expected 2) for toast value
> 85223889 in pg_toast_16424
> STATEMENT:  update foo set count=count+1 where text_array @> $1

I'm a bit dumbfounded here. I think I found the issue, and, if so, it's
*very* longstanding. HeapTupleSatisfiesToast(), aborted transactions,
and hint bits (or oid wraparound) appear to be a broken combination.

What I'm seing is that tuptoaster.c is trying to fetch or delete toast
chunks generated in an aborted (via crash) transaction. And that, with
the same valueid/chunk_id, a valid value exist somewhere else in the
toast table.

HeapTupleSatisfiesToast() does not do any actual tuple liveliness
detection besides checking basic hint bits. Thus, if there's a dead
tuple in the toast table's heap, which a *live* reference from the
table, toast_fetch_datum() potentially returns the wrong results.

That makes the new question: How can there be a live reference to a dead
and reused toast id?  As far as I can see the primary problem is that
GetNewOidWithIndex() uses SnapshotDirty to verify whether a tuple is
live.  As that actually verifies tuple liveliness (and sets hint bits),
it'll skip over an aborted toasted datum.  A valid question is why the
hint bits set by HeapTupleSatisfiesDirty() doesn't prevent that from
occuring (besides standbys which aren't evolved): It's the constant
crashes.  Looking at a tuple which triggers the "unexpected chunk
number" error, it indeed has a live and a dead version of the same chunk
id.

Interestingly the dead rows looks like:
chunk_id = 1 - t_infomask: 0xA02 - HEAP_XMAX_INVALID | HEAP_XMIN_INVALID | HEAP_HASVARWIDTH
chunk_id = 2 - t_infomask: 0x802 - HEAP_XMAX_INVALID | HEAP_HASVARWIDTH

My rather strong suspicion on why that is, is that both tuples were
visited by SnapshotDirty(), but only the first ones hint bits got
persisted, because checksums were enabled. The second time through
MarkBufferDirtyHint skipped work, because the buffer was already dirty.

rmgr: XLOG        len (rec/tot):      0/  3809, tx:  206127725, lsn: 1C/EB3B0910, prev 1C/EB3B08E0, desc: FPI_FOR_HINT
,blkref #0: rel 1663/16384/16433 blk 202567 FPW
 
that's indeed the only time the affected page is touched as visible to
pg_xlogdump after being filled...


The easy way to trigger this problem would be to have an oid wraparound
- but the WAL shows that that's not the case here.  I've not figured
that one out entirely (and won't tonight). But I do see WAL records
like:
rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
i.e. two NEXTOID records allocating the same range, which obviously
doesn't seem right.  There's also every now and then close by ranges:
rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID
3311455
rmgr: XLOG        len (rec/tot):      4/    30, tx:    7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID
3311461


As far as I can see something like the above, or an oid wraparound, are
pretty much deadly for toast.

Is anybody ready with a good defense for SatisfiesToast not doing any
actual liveliness checks?


Andres



On 10 May 2016 at 09:05, Andres Freund <andres@anarazel.de> wrote:
 
Is anybody ready with a good defense for SatisfiesToast not doing any
actual liveliness checks?

I provided a patch earlier that rechecks the OID fetched from a toast chunk matches the OID requested.

I didn't commit it, I just used it to check the patch which changed btree vacuum replay. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
> The easy way to trigger this problem would be to have an oid wraparound
> - but the WAL shows that that's not the case here.  I've not figured
> that one out entirely (and won't tonight). But I do see WAL records
> like:
> rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
> rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
> i.e. two NEXTOID records allocating the same range, which obviously
> doesn't seem right.  There's also every now and then close by ranges:
> rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID
3311455
> rmgr: XLOG        len (rec/tot):      4/    30, tx:    7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID
3311461
>
>
> As far as I can see something like the above, or an oid wraparound, are
> pretty much deadly for toast.
>
> Is anybody ready with a good defense for SatisfiesToast not doing any
> actual liveliness checks?

I assume that this was installed as a performance optimization, and I
don't really see why it shouldn't be or be able to be made safe.  I
assume that the wraparound case was deemed safe because at that time
the idea of 4 billion OIDs getting used with old transactions still
active seemed inconceivable.  It seems to me that the real question
here is how you're getting two calls to XLogPutNextOid() with the same
value of ShmemVariableCache->nextOid, and the answer, as it seems to
me, must be that LWLocks are broken.  Either two processes are
managing to hold OidGenLock in exclusive mode at the same time, or
they're acquiring it in quick succession but without the second
process seeing all of the updates performed by the first process.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Andres Freund <andres@anarazel.de> writes:
> Is anybody ready with a good defense for SatisfiesToast not doing any
> actual liveliness checks?

As long as we do not update toast values after creation, there is no
need; the liveness check on the parent tuple is what's important.
Adding a liveness check on the toast item would merely create new
failure modes with no corresponding benefit.  Imagine deciding
that field 3 of a regular tuple was somehow dead even though the
rest of the tuple is live --- how can that be good?

I concur with Robert that what this looks like is failure to ensure
that toast OIDs are unique, which is an entirely different problem.
        regards, tom lane



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
> On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
> > The easy way to trigger this problem would be to have an oid wraparound
> > - but the WAL shows that that's not the case here.  I've not figured
> > that one out entirely (and won't tonight). But I do see WAL records
> > like:
> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
> > i.e. two NEXTOID records allocating the same range, which obviously
> > doesn't seem right.  There's also every now and then close by ranges:
> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID
3311455
> > rmgr: XLOG        len (rec/tot):      4/    30, tx:    7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID
3311461
> >
> >
> > As far as I can see something like the above, or an oid wraparound, are
> > pretty much deadly for toast.
> >
> > Is anybody ready with a good defense for SatisfiesToast not doing any
> > actual liveliness checks?
> 
> I assume that this was installed as a performance optimization, and I
> don't really see why it shouldn't be or be able to be made safe.  I
> assume that the wraparound case was deemed safe because at that time
> the idea of 4 billion OIDs getting used with old transactions still
> active seemed inconceivable.

It's not super likely, yea. But you don't really need to "use" 4 billion
oids to get a wraparound. Once you have a significant number of values
in various toast tables, the oid counter progresses really rather fast,
without many writes. That's because the oid counter is global, but each
individual toast write (and other things), perform checks via
GetNewOidWithIndex().

I'm not sure why you think it's safe? Consider the following scenario:

BEGIN;
-- nextoid: 1
INSERT toastval = chunk_id = 1;
ROLLBACK:
...
-- oid counter wraps around
-- nextoid: 1
INSERT toastval = chunk_id = 1;

-- crash, loosing all pending hint bits

SELECT toastval;

The last SELECT might find either of the toasted data chunks, depending
on heap ordering. As they're not hinted as invalid due to the crash,
HeapTupleSatisfiesToast() will return both as visible.


To make that safe we'd at least make hint bit writes by the scan in
GetNewOidWithIndex() durable, and likely also disable the killtuples
optimization; to avoid a plain SELECT of the toast table to make some
tuples unreachable, but not durably hinted.


That seems fairly fragile. I've a significant amount of doubt that toast
reads are bottlenecked by visibility routines.


> It seems to me that the real question
> here is how you're getting two calls to XLogPutNextOid() with the same
> value of ShmemVariableCache->nextOid, and the answer, as it seems to
> me, must be that LWLocks are broken.

There likely were a bunch of crashes in between, Jeff's test suite
triggers them at a high rate. It seems a lot more likely than that an
lwlock bug only materializes in the oid counter.  Investigating.

Greetings,

Andres Freund



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 12:28:57 +0200, Simon Riggs wrote:
> On 10 May 2016 at 09:05, Andres Freund <andres@anarazel.de> wrote:
> 
> 
> > Is anybody ready with a good defense for SatisfiesToast not doing any
> > actual liveliness checks?
> >
> 
> I provided a patch earlier that rechecks the OID fetched from a toast chunk
> matches the OID requested.

They match in this case, so that's not likely to help with the issue at
hand?

- Andres



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 10:09:07 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Is anybody ready with a good defense for SatisfiesToast not doing any
> > actual liveliness checks?
> 
> As long as we do not update toast values after creation, there is no
> need; the liveness check on the parent tuple is what's important.
> Adding a liveness check on the toast item would merely create new
> failure modes with no corresponding benefit.

You mean besides not having a corrupted database like in this case?


> Imagine deciding that field 3 of a regular tuple was somehow dead even
> though the rest of the tuple is live --- how can that be good?

How would ever happen in normal operation? And why are we checking hint
bits in HeapTupleSatisfiesToast() in that case?if (!HeapTupleHeaderXminCommitted(tuple)){    if
(HeapTupleHeaderXminInvalid(tuple))       return false;
 

The only way I can see the scenario you're described happening is if
there's a bug somewhere else.


> I concur with Robert that what this looks like is failure to ensure
> that toast OIDs are unique, which is an entirely different problem.

Well, I did describe how this can come about due to a wraparound, and
how GetNewOidWithIndex isn't sufficient to prevent the problem, as it
only hints (which are by definition not durable)...

Greetings,

Andres Freund



On Tue, May 10, 2016 at 12:19 PM, Andres Freund <andres@anarazel.de> wrote:
>> I assume that this was installed as a performance optimization, and I
>> don't really see why it shouldn't be or be able to be made safe.  I
>> assume that the wraparound case was deemed safe because at that time
>> the idea of 4 billion OIDs getting used with old transactions still
>> active seemed inconceivable.
>
> It's not super likely, yea. But you don't really need to "use" 4 billion
> oids to get a wraparound. Once you have a significant number of values
> in various toast tables, the oid counter progresses really rather fast,
> without many writes. That's because the oid counter is global, but each
> individual toast write (and other things), perform checks via
> GetNewOidWithIndex().

Understood.

> I'm not sure why you think it's safe? Consider the following scenario:
>
> BEGIN;
> -- nextoid: 1
> INSERT toastval = chunk_id = 1;
> ROLLBACK:
> ...
> -- oid counter wraps around
> -- nextoid: 1
> INSERT toastval = chunk_id = 1;

Uh oh.  That's really bad.  As long as we vacuum the table every 4
billion OID counter uses, and no long-running transactions prevent us
from doing cleanup, there will be no issue of this sort, but there's
no guarantee of those things being true on modern hardware.  And I
doubt we want to have a relmindeadoid to go with relfrozenxid and
relminmxid.  Our last round of convincing VACUUM to take into account
one more kind of wraparound prevention wasn't loads of fun.   Not to
mention the fact that this could wrap FAR faster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, May 10, 2016 at 12:19 PM, Andres Freund <andres@anarazel.de> wrote:
>> It's not super likely, yea. But you don't really need to "use" 4 billion
>> oids to get a wraparound. Once you have a significant number of values
>> in various toast tables, the oid counter progresses really rather fast,
>> without many writes. That's because the oid counter is global, but each
>> individual toast write (and other things), perform checks via
>> GetNewOidWithIndex().

> Understood.

Sooner or later we are going to need to go to 8-byte TOAST object
identifiers.  Maybe we should think about doing that sooner not later
rather than trying to invent some anti-wraparound solution here.

In principle, you could support existing TOAST tables and pointers
containing 4-byte IDs in parallel with the new ones.  Not sure how
pg_upgrade would handle it exactly though.
        regards, tom lane



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, May 10, 2016 at 12:19 PM, Andres Freund <andres@anarazel.de> wrote:
> >> It's not super likely, yea. But you don't really need to "use" 4 billion
> >> oids to get a wraparound. Once you have a significant number of values
> >> in various toast tables, the oid counter progresses really rather fast,
> >> without many writes. That's because the oid counter is global, but each
> >> individual toast write (and other things), perform checks via
> >> GetNewOidWithIndex().
> 
> > Understood.
> 
> Sooner or later we are going to need to go to 8-byte TOAST object
> identifiers.  Maybe we should think about doing that sooner not later
> rather than trying to invent some anti-wraparound solution here.

Umm, it seems to me like we need this fixed in supported branches, not
just 9.7, so I don't think 8-byte toast IDs are a reasonable solution at
this point.

> In principle, you could support existing TOAST tables and pointers
> containing 4-byte IDs in parallel with the new ones.  Not sure how
> pg_upgrade would handle it exactly though.

I suppose the real problem is that there's no way to have a mix of 4-
and 8-byte identifiers in the same toast table.  I suppose we could have
two toast tables for each regular table, but that sounds pretty painful.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On 10 May 2016 at 17:20, Andres Freund <andres@anarazel.de> wrote:
On 2016-05-10 12:28:57 +0200, Simon Riggs wrote:
> On 10 May 2016 at 09:05, Andres Freund <andres@anarazel.de> wrote:
>
>
> > Is anybody ready with a good defense for SatisfiesToast not doing any
> > actual liveliness checks?
> >
>
> I provided a patch earlier that rechecks the OID fetched from a toast chunk
> matches the OID requested.

They match in this case, so that's not likely to help with the issue at
hand?

I can see, but I was answering your more general question above with what I had available.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Sooner or later we are going to need to go to 8-byte TOAST object
>> identifiers.  Maybe we should think about doing that sooner not later
>> rather than trying to invent some anti-wraparound solution here.

> Umm, it seems to me like we need this fixed in supported branches, not
> just 9.7, so I don't think 8-byte toast IDs are a reasonable solution at
> this point.

I think the problem is pretty hypothetical until you get to consuming a
substantial part of the OID space within any one toast table, at which
point you're going to need 8-byte toast OIDs.  Improving that situation
seems like something we can define as a future feature.

(Note that I do not believe for a moment that Andres has actually
exhibited such a problem in his test case.  He'd have needed several
TB worth of TOAST space to get to that point.)

>> In principle, you could support existing TOAST tables and pointers
>> containing 4-byte IDs in parallel with the new ones.  Not sure how
>> pg_upgrade would handle it exactly though.

> I suppose the real problem is that there's no way to have a mix of 4-
> and 8-byte identifiers in the same toast table.  I suppose we could have
> two toast tables for each regular table, but that sounds pretty painful.

Hmm.  It's not impossible by any means, since you could tell by looking
at a toast OID which table it was in.  And that way might provide a less
painful upgrade process for an existing table that grew to the point of
needing bigger OIDs.
        regards, tom lane



On Tue, May 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
>> On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
>> > The easy way to trigger this problem would be to have an oid wraparound
>> > - but the WAL shows that that's not the case here.  I've not figured
>> > that one out entirely (and won't tonight). But I do see WAL records
>> > like:
>> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
>> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693

Were there any CHECKPOINT_SHUTDOWN records, or any other NEXTOID
records, between those two records you show?


My current test harness updates the scalar count field on every
iteration, but changes the (probably toasted) text_array field with a
probability of only 1% each time.  Perhaps making that more likely (by
changing line 186 of count.pl) would make it easier to trigger the
bug.  I'll try that in my next iteration of tests.

Cheers,

Jeff



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 09:19:16 -0700, Andres Freund wrote:
> On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
> > On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
> > > The easy way to trigger this problem would be to have an oid wraparound
> > > - but the WAL shows that that's not the case here.  I've not figured
> > > that one out entirely (and won't tonight). But I do see WAL records
> > > like:
> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
> > > i.e. two NEXTOID records allocating the same range, which obviously
> > > doesn't seem right.  There's also every now and then close by ranges:
> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID
3311455
> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:    7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID
3311461

> > It seems to me that the real question
> > here is how you're getting two calls to XLogPutNextOid() with the same
> > value of ShmemVariableCache->nextOid, and the answer, as it seems to
> > me, must be that LWLocks are broken.
> 
> There likely were a bunch of crashes in between, Jeff's test suite
> triggers them at a high rate. It seems a lot more likely than that an
> lwlock bug only materializes in the oid counter.  Investigating.

void
CreateCheckPoint(int flags)
{
...   /*    * An end-of-recovery checkpoint is really a shutdown checkpoint, just    * issued at a different time.
*/  if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))       shutdown = true;   else       shutdown =
false;
...
   LWLockAcquire(OidGenLock, LW_SHARED);   checkPoint.nextOid = ShmemVariableCache->nextOid;   if (!shutdown)
checkPoint.nextOid+= ShmemVariableCache->oidCount;   LWLockRelease(OidGenLock);
 
...   recptr = XLogInsert(RM_XLOG_ID,                       shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
XLOG_CHECKPOINT_ONLINE);
...
}

I think that's to blame here.  Looking at the relevant WAL record shows:

pg_xlogdump -p /data/freund/jj -s 2/12004018 -e 2/1327EA28|grep -E 'CHECKPOINT|NEXTOID'
rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 2/12023C38, prev 2/12023C00, desc:
CHECKPOINT_ONLINEredo 2/12000120; /* ... */ oid 4294501; /* ... */ online
 
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 2/1327A798, prev 2/1327A768, desc:
CHECKPOINT_SHUTDOWNredo 2/1327A798; /* ... */ oid 4294501; /* ... */ shutdown
 
rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693

(note that end-of-recovery checkpoints are logged as shutdown
checkpoints, pretty annoying imo)

So I think the issue is that the 2/12023C38 checkpoint *starts* before
the first NEXTOID, and thus gets an earlier nextoid.  The second -
shutdown/end-of-recovery - checkpoint then hits the above !shutdown and
doesn't add oidCount.  Thus after the crash we continue with a repeated
NEXOID.

There's this remark in xlog_redo():    /*     * We used to try to take the maximum of ShmemVariableCache->nextOid     *
andthe recorded nextOid, but that fails if the OID counter wraps     * around.  Since no OID allocation should be
happeningduring replay     * anyway, better to just believe the record exactly.  We still take     * OidGenLock while
settingthe variable, just in case.     */
 

I think that was perhaps not the best fix :(


I guess what we should do is to only use checkPoint.nextOid when
starting up from a checkpoint, and entirely rely on NEXTOID otherwise?

Regards,

Andres



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 15:20:39 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > On Tue, May 10, 2016 at 12:19 PM, Andres Freund <andres@anarazel.de> wrote:
> > >> It's not super likely, yea. But you don't really need to "use" 4 billion
> > >> oids to get a wraparound. Once you have a significant number of values
> > >> in various toast tables, the oid counter progresses really rather fast,
> > >> without many writes. That's because the oid counter is global, but each
> > >> individual toast write (and other things), perform checks via
> > >> GetNewOidWithIndex().
> > 
> > > Understood.
> > 
> > Sooner or later we are going to need to go to 8-byte TOAST object
> > identifiers.  Maybe we should think about doing that sooner not later
> > rather than trying to invent some anti-wraparound solution here.
> 
> Umm, it seems to me like we need this fixed in supported branches, not
> just 9.7, so I don't think 8-byte toast IDs are a reasonable solution at
> this point.

Agreed. To me it seems we need to make HeapTupleSatisfiesToast() safe,
and other improvements are tangential.

Andres



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 16:14:50 -0400, Tom Lane wrote:
> I think the problem is pretty hypothetical until you get to consuming a
> substantial part of the OID space within any one toast table, at which
> point you're going to need 8-byte toast OIDs.  Improving that situation
> seems like something we can define as a future feature.

Doesn't really have to be within a single toast table. As the oid
counter is global, it just needs to advance quickly enough globally.


> (Note that I do not believe for a moment that Andres has actually
> exhibited such a problem in his test case.  He'd have needed several
> TB worth of TOAST space to get to that point.)

That's good, because I explicitly said that there was no wraparound
involved in the test case.


> >> In principle, you could support existing TOAST tables and pointers
> >> containing 4-byte IDs in parallel with the new ones.  Not sure how
> >> pg_upgrade would handle it exactly though.
> 
> > I suppose the real problem is that there's no way to have a mix of 4-
> > and 8-byte identifiers in the same toast table.  I suppose we could have
> > two toast tables for each regular table, but that sounds pretty painful.
> 
> Hmm.  It's not impossible by any means, since you could tell by looking
> at a toast OID which table it was in.  And that way might provide a less
> painful upgrade process for an existing table that grew to the point of
> needing bigger OIDs.

The catalog representation (as in pg_class.reltoastrelid) isn't entirely
clear to me.  One way would be to invert pg_class.reltoastrelid's
meaning and have the toast table point to the table it stores values
for. That'd also open the potential of having one toast table per column
and such.

Greetings,

Andres Freund



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 13:17:52 -0700, Jeff Janes wrote:
> On Tue, May 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
> >> On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
> >> > The easy way to trigger this problem would be to have an oid wraparound
> >> > - but the WAL shows that that's not the case here.  I've not figured
> >> > that one out entirely (and won't tonight). But I do see WAL records
> >> > like:
> >> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
> >> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
> 
> Were there any CHECKPOINT_SHUTDOWN records, or any other NEXTOID
> records, between those two records you show?

Yes, check http://www.postgresql.org/message-id/20160510210013.2akn4iee7gl4ycen@alap3.anarazel.de

I think the explanation about how the bug is occuring there makes sense.


> My current test harness updates the scalar count field on every
> iteration, but changes the (probably toasted) text_array field with a
> probability of only 1% each time.  Perhaps making that more likely (by
> changing line 186 of count.pl) would make it easier to trigger the
> bug.  I'll try that in my next iteration of tests.

So my current theory about why the whole thing is kinda hard to
reproduce is that "luck" determines how aggressively the toast table is
vacuumed, and how often it actually succeeds in being vacuumed. You also
need a good bit of bad luck for the hint bits by GetNewOidWithIndex() to
not survive, given that shared_buffers is pretty small *and* checksums
are enabled.

I guess testing with a bigger shared memory and without checksums will
make it easier to hit the bug.

Regards,

Andres



Andres Freund <andres@anarazel.de> writes:
> On 2016-05-10 16:14:50 -0400, Tom Lane wrote:
>> I think the problem is pretty hypothetical until you get to consuming a
>> substantial part of the OID space within any one toast table, at which
>> point you're going to need 8-byte toast OIDs.  Improving that situation
>> seems like something we can define as a future feature.

> Doesn't really have to be within a single toast table. As the oid
> counter is global, it just needs to advance quickly enough globally.

I don't think that alters the conclusion.  To get rapid advancement
through the whole 4G OID space, you'd need densely packed existing
OIDs in all of the toast tables being inserted into.  And the more
toast tables in play, the less likely any one of them is densely
packed, because it's likely some OIDs were assigned into other
toast tables.

Now having said that, you don't actually need *rapid* advancement
of the OID counter to have a potential problem.  Imagine that some
transaction inserts a TOAST value and later fails, so that you have
a dead-but-unhinted toast tuple sitting there.  If no VACUUM comes
along to clean it up (quite possible if there's low activity on
the parent table), then after 4G uses of OIDs, you could be unlucky
enough that the next attempted insertion into that TOAST table chooses
the same OID.  Now, that choosing would be done by GetNewOidWithIndex
which would have to have seen the dead toast tuple; but if its hinting of
the tuple got lost somehow then you could have two tuples with the same
OID and neither hinted as dead.  Then subsequent TOAST fetches have
a problem that they might take the wrong one.

Having said that, I still say that changing HeapTupleSatisfiesToast
is the wrong thing.  It can't go deciding not to return toast values
because they're committed dead --- the parent tuple could easily be
committed dead as well, and yet still be visible to our query's
snapshot.  Nor is it attractive to introduce more hint-bit-setting
requirements for toast tuples; that just increases the amount of
hint-bit-setting disk traffic that must occur subsequently to
any update involving toast values.

Maybe the right fix here is for toast OID assignment to use
GetNewOidWithIndex with SnapshotToast, or some special-purpose
snapshot that does the right thing, so that it would see such a
dead toast row as being a conflict.  (If you got enough such
dead rows, this might make it hard to find a free OID; but hopefully
vacuum would come along and fix things before it got to that.)
        regards, tom lane



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 18:29:59 -0400, Tom Lane wrote:
> Now having said that, you don't actually need *rapid* advancement
> of the OID counter to have a potential problem.  Imagine that some
> transaction inserts a TOAST value and later fails, so that you have
> a dead-but-unhinted toast tuple sitting there.  If no VACUUM comes
> along to clean it up (quite possible if there's low activity on
> the parent table)

Or if there's primarily inserts...


> Having said that, I still say that changing HeapTupleSatisfiesToast
> is the wrong thing.  It can't go deciding not to return toast values
> because they're committed dead --- the parent tuple could easily be
> committed dead as well, and yet still be visible to our query's
> snapshot.

Hm. Shouldn't a mvcc snapshot be able to differentiate between those
cases? When are we looking up toast tuple that's *not* visible to the
current snapshot?   I guess there could be some uglyness around
RETURNING for DELETE.   I guess we'd have to use logic akin to
SatisfiesVacuum, treating HEAPTUPLE_*_IN_PROGRESS/RECENTLY_DEAD as
alive.


> Nor is it attractive to introduce more hint-bit-setting
> requirements for toast tuples; that just increases the amount of
> hint-bit-setting disk traffic that must occur subsequently to
> any update involving toast values.

Yea, that's the big argument against going this way.


> Maybe the right fix here is for toast OID assignment to use
> GetNewOidWithIndex with SnapshotToast, or some special-purpose
> snapshot that does the right thing, so that it would see such a
> dead toast row as being a conflict.  (If you got enough such
> dead rows, this might make it hard to find a free OID; but hopefully
> vacuum would come along and fix things before it got to that.)

Seems like we'd have to either emit cleanup records, or treat even
tuples that are hinted as dead as a conflict (lest they get lost in a
crash or failover).

Greetings,

Andres Freund



On Tue, May 10, 2016 at 2:00 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-05-10 09:19:16 -0700, Andres Freund wrote:
>> On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
>> > On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
>> > > The easy way to trigger this problem would be to have an oid wraparound
>> > > - but the WAL shows that that's not the case here.  I've not figured
>> > > that one out entirely (and won't tonight). But I do see WAL records
>> > > like:
>> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
>> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
>> > > i.e. two NEXTOID records allocating the same range, which obviously
>> > > doesn't seem right.  There's also every now and then close by ranges:
>> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID
3311455
>> > > rmgr: XLOG        len (rec/tot):      4/    30, tx:    7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID
3311461
>
>> > It seems to me that the real question
>> > here is how you're getting two calls to XLogPutNextOid() with the same
>> > value of ShmemVariableCache->nextOid, and the answer, as it seems to
>> > me, must be that LWLocks are broken.
>>
>> There likely were a bunch of crashes in between, Jeff's test suite
>> triggers them at a high rate. It seems a lot more likely than that an
>> lwlock bug only materializes in the oid counter.  Investigating.
>
> void
> CreateCheckPoint(int flags)
> {
> ...
>     /*
>      * An end-of-recovery checkpoint is really a shutdown checkpoint, just
>      * issued at a different time.
>      */
>     if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
>         shutdown = true;
>     else
>         shutdown = false;
> ...
>
>     LWLockAcquire(OidGenLock, LW_SHARED);
>     checkPoint.nextOid = ShmemVariableCache->nextOid;
>     if (!shutdown)
>         checkPoint.nextOid += ShmemVariableCache->oidCount;
>     LWLockRelease(OidGenLock);
> ...
>     recptr = XLogInsert(RM_XLOG_ID,
>                         shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
>                         XLOG_CHECKPOINT_ONLINE);
> ...
> }
>
> I think that's to blame here.  Looking at the relevant WAL record shows:
>
> pg_xlogdump -p /data/freund/jj -s 2/12004018 -e 2/1327EA28|grep -E 'CHECKPOINT|NEXTOID'
> rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
> rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 2/12023C38, prev 2/12023C00, desc:
CHECKPOINT_ONLINEredo 2/12000120; /* ... */ oid 4294501; /* ... */ online
 

By my understanding, this is the point at which the crash occurred.

> rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 2/1327A798, prev 2/1327A768, desc:
CHECKPOINT_SHUTDOWNredo 2/1327A798; /* ... */ oid 4294501; /* ... */ shutdown
 
> rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
>
> (note that end-of-recovery checkpoints are logged as shutdown
> checkpoints, pretty annoying imo)
>
> So I think the issue is that the 2/12023C38 checkpoint *starts* before
> the first NEXTOID, and thus gets an earlier nextoid.


But isn't CreateCheckPoint called at the end of the checkpoint, not
the start of it?

I don't understand how it could be out-of-date at that point.  But
obviously it is.


Cheers,

Jeff



Andres Freund <andres@anarazel.de> writes:
> On 2016-05-10 18:29:59 -0400, Tom Lane wrote:
>> Having said that, I still say that changing HeapTupleSatisfiesToast
>> is the wrong thing.  It can't go deciding not to return toast values
>> because they're committed dead --- the parent tuple could easily be
>> committed dead as well, and yet still be visible to our query's
>> snapshot.

> Hm. Shouldn't a mvcc snapshot be able to differentiate between those
> cases?

HeapTupleSatisfiesToast doesn't have one.  And changing things so that
toast tuples are checked using MVCC rules is the wrong thing anyway,
because it would require adding hint-bit update traffic for toast
tables.

> When are we looking up toast tuple that's *not* visible to the
> current snapshot?

Once again, it's the parent tuple where we should be doing the
visibility check; noplace else.
        regards, tom lane



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 15:53:38 -0700, Jeff Janes wrote:
> On Tue, May 10, 2016 at 2:00 PM, Andres Freund <andres@anarazel.de> wrote:
> > I think that's to blame here.  Looking at the relevant WAL record shows:
> >
> > pg_xlogdump -p /data/freund/jj -s 2/12004018 -e 2/1327EA28|grep -E 'CHECKPOINT|NEXTOID'
> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID
4302693
> > rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 2/12023C38, prev 2/12023C00, desc:
CHECKPOINT_ONLINEredo 2/12000120; /* ... */ oid 4294501; /* ... */ online
 
>
> By my understanding, this is the point at which the crash occurred.

Right.

> > rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 2/1327A798, prev 2/1327A768, desc:
CHECKPOINT_SHUTDOWNredo 2/1327A798; /* ... */ oid 4294501; /* ... */ shutdown
 
> > rmgr: XLOG        len (rec/tot):      4/    30, tx:          0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID
4302693
> >
> > (note that end-of-recovery checkpoints are logged as shutdown
> > checkpoints, pretty annoying imo)
> >
> > So I think the issue is that the 2/12023C38 checkpoint *starts* before
> > the first NEXTOID, and thus gets an earlier nextoid.
>
>
> But isn't CreateCheckPoint called at the end of the checkpoint, not
> the start of it?

No, CreateCheckPoint() does it all.


CreateCheckPoint(int flags)
{
...   /* 1) determine redo pointer */   WALInsertLockAcquireExclusive();   curInsert =
XLogBytePosToRecPtr(Insert->CurrBytePos);  prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
WALInsertLockRelease();
...   /* 2) determine oid */   LWLockAcquire(OidGenLock, LW_SHARED);   checkPoint.nextOid =
ShmemVariableCache->nextOid;  if (!shutdown)       checkPoint.nextOid += ShmemVariableCache->oidCount;
LWLockRelease(OidGenLock);
...   /* 3) actually checkpoint shared_buffers et al. */   CheckPointGuts(checkPoint.redo, flags);
...   /* 4) log the checkpoint */   recptr = XLogInsert(RM_XLOG_ID,                       shutdown ?
XLOG_CHECKPOINT_SHUTDOWN:                       XLOG_CHECKPOINT_ONLINE);
 
...
}


> I don't understand how it could be out-of-date at that point.  But
> obviously it is.

A checkpoint logically "starts" at 1) in the above abbreviated
CreateCheckPoint(), that's where recovery starts when starting up from
that checkpoint. But inbetween 1) and 4) all other backends can continue
to insert WAL, and it'll be replayed *before* the checkpoint's record
itself.  That means that if some backend generates a NEXTOID record
between 2) and 4) (with largers checkpoints we're looking at minutes to
an hour of time), it's effects will temporarily take effect (as in
ShmemVariableCache->nextOid is updated), but XLOG_CHECKPOINT_ONLINE's
replay will overwrite it unconditionally:
void
xlog_redo(XLogReaderState *record)
{   else if (info == XLOG_CHECKPOINT_ONLINE)   {
...       /* ... but still treat OID counter as exact */       LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
ShmemVariableCache->nextOid= checkPoint.nextOid;       ShmemVariableCache->oidCount = 0;
LWLockRelease(OidGenLock);

Makes sense?

Regards,

Andres



On Tue, May 10, 2016 at 4:02 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-05-10 15:53:38 -0700, Jeff Janes wrote:
>>
>> But isn't CreateCheckPoint called at the end of the checkpoint, not
>> the start of it?
>
> No, CreateCheckPoint() does it all.
>
>
> CreateCheckPoint(int flags)
> {
> ...
>     /* 1) determine redo pointer */
>     WALInsertLockAcquireExclusive();
>     curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
>     prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
>     WALInsertLockRelease();
> ...
>     /* 2) determine oid */
>     LWLockAcquire(OidGenLock, LW_SHARED);
>     checkPoint.nextOid = ShmemVariableCache->nextOid;
>     if (!shutdown)
>         checkPoint.nextOid += ShmemVariableCache->oidCount;
>     LWLockRelease(OidGenLock);
> ...
>     /* 3) actually checkpoint shared_buffers et al. */
>     CheckPointGuts(checkPoint.redo, flags);
> ...
>     /* 4) log the checkpoint */
>     recptr = XLogInsert(RM_XLOG_ID,
>                         shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
>                         XLOG_CHECKPOINT_ONLINE);
> ...
> }

OK, got it.  I don' t know how I missed the bigger picture of that
function in the first place.

But, another perhaps stupid question, why do we care what the value of
nextOid was at the start of the last successfully completed
checkpoint?  Wouldn't it make more sense to populate that part of the
record at the end, not the start, of the checkpoint?


>
>
>> I don't understand how it could be out-of-date at that point.  But
>> obviously it is.
>
> A checkpoint logically "starts" at 1) in the above abbreviated
> CreateCheckPoint(), that's where recovery starts when starting up from
> that checkpoint. But inbetween 1) and 4) all other backends can continue
> to insert WAL, and it'll be replayed *before* the checkpoint's record
> itself.  That means that if some backend generates a NEXTOID record
> between 2) and 4) (with largers checkpoints we're looking at minutes to
> an hour of time), it's effects will temporarily take effect (as in
> ShmemVariableCache->nextOid is updated), but XLOG_CHECKPOINT_ONLINE's
> replay will overwrite it unconditionally:
> void
> xlog_redo(XLogReaderState *record)
> {
>     else if (info == XLOG_CHECKPOINT_ONLINE)
>     {
> ...
>         /* ... but still treat OID counter as exact */
>         LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
>         ShmemVariableCache->nextOid = checkPoint.nextOid;
>         ShmemVariableCache->oidCount = 0;
>         LWLockRelease(OidGenLock);
>
> Makes sense?

So it seems like we should unconditionally **not** do that when
replaying XLOG_CHECKPOINT_ONLINE, right?

If this is the checkpoint record at which we started the system then
we would have done that "replay" into ShmemVariableCache->nextOid
during StartupXLOG(), there is no reason to do it again when redo
passes through that record the 2nd time.

But, to round this out, all of this is formally only a hint on where
to start OIDs so as to avoid performance problems as you busy-loop
looking for an unused OID.  The real correctness bug is in the
hint-bit problem you discuss elsewhere that turns collisions into
corruption, and this bug just makes that other one reachable?

Cheers,

Jeff



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 17:36:06 -0700, Jeff Janes wrote:
> OK, got it.  I don' t know how I missed the bigger picture of that
> function in the first place.
> 
> But, another perhaps stupid question, why do we care what the value of
> nextOid was at the start of the last successfully completed
> checkpoint?  Wouldn't it make more sense to populate that part of the
> record at the end, not the start, of the checkpoint?

I don't think that really makes a large difference. Unless we'd hold
OidGenLock across the XLogInsert() - which we don't want to do for
performance / deadlock reasons - we'd have a similar race.   I think
it's simply important to recognize (and better document!) that the
values in the checkpoint record are from roughly the time of the REDO
pointer and not from the end of the checkpoint.


> >> I don't understand how it could be out-of-date at that point.  But
> >> obviously it is.
> >
> > A checkpoint logically "starts" at 1) in the above abbreviated
> > CreateCheckPoint(), that's where recovery starts when starting up from
> > that checkpoint. But inbetween 1) and 4) all other backends can continue
> > to insert WAL, and it'll be replayed *before* the checkpoint's record
> > itself.  That means that if some backend generates a NEXTOID record
> > between 2) and 4) (with largers checkpoints we're looking at minutes to
> > an hour of time), it's effects will temporarily take effect (as in
> > ShmemVariableCache->nextOid is updated), but XLOG_CHECKPOINT_ONLINE's
> > replay will overwrite it unconditionally:
> > void
> > xlog_redo(XLogReaderState *record)
> > {
> >     else if (info == XLOG_CHECKPOINT_ONLINE)
> >     {
> > ...
> >         /* ... but still treat OID counter as exact */
> >         LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
> >         ShmemVariableCache->nextOid = checkPoint.nextOid;
> >         ShmemVariableCache->oidCount = 0;
> >         LWLockRelease(OidGenLock);
> >
> > Makes sense?
> 
> So it seems like we should unconditionally **not** do that when
> replaying XLOG_CHECKPOINT_ONLINE, right?

I think we should actually just not do it during replay *at all*. It
seems entirely sufficient to use ->nextOid in StartupXLOG when it has
found the checkpoint to start up from:/* initialize shared memory variables from the checkpoint record
*/ShmemVariableCache->nextXid= checkPoint.nextXid;ShmemVariableCache->nextOid =
checkPoint.nextOid;ShmemVariableCache->oidCount= 0;
 


FWIW, I've commented out the relevant sections from xlog_redo and since
then I've not been able to reproduce the issue.


> But, to round this out, all of this is formally only a hint on where
> to start OIDs so as to avoid performance problems as you busy-loop
> looking for an unused OID.  The real correctness bug is in the
> hint-bit problem you discuss elsewhere that turns collisions into
> corruption, and this bug just makes that other one reachable?

That's my opinion at least.

Greetings,

Andres Freund



Jeff Janes <jeff.janes@gmail.com> writes:
> But, another perhaps stupid question, why do we care what the value of
> nextOid was at the start of the last successfully completed
> checkpoint?

The intended use of that field is to restore nextOid before replaying
WAL.  So it should correspond to the value at checkpoint start.  I think
what the code is doing now during replay is probably just brain fade :-(

> Wouldn't it make more sense to populate that part of the
> record at the end, not the start, of the checkpoint?

The idea is that we should initialize the OID counter to something
approximately right before we start replay, so that replay of OID-
counter-advance records behaves sensibly.

> But, to round this out, all of this is formally only a hint on where
> to start OIDs so as to avoid performance problems as you busy-loop
> looking for an unused OID.  The real correctness bug is in the
> hint-bit problem you discuss elsewhere that turns collisions into
> corruption, and this bug just makes that other one reachable?

Right, the theory is that being slightly off shouldn't matter.
If that's wrong, we would have race-condition problems with either
intended timing of sampling the OID counter.
        regards, tom lane



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Teodor Sigaev
Date:
>>      Allow Pin/UnpinBuffer to operate in a lockfree manner.
>> I get the errors:
>>
>> ERROR:  attempted to delete invisible tuple
>> ERROR:  unexpected chunk number 1 (expected 2) for toast value

Just reminder, you investigate "unexpected chunk number" problem, but, seems, we 
have another bug (first ERROR: attempted to delete invisible tuple). IMHO, it's 
a separate bug, not related to oid. Unfortunately, I've never seen such error on 
my notebook.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:

On May 11, 2016 7:06:05 AM PDT, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>>      Allow Pin/UnpinBuffer to operate in a lockfree manner.
>>> I get the errors:
>>>
>>> ERROR:  attempted to delete invisible tuple
>>> ERROR:  unexpected chunk number 1 (expected 2) for toast value
>
>Just reminder, you investigate "unexpected chunk number" problem, but,
>seems, we 
>have another bug (first ERROR: attempted to delete invisible tuple).
>IMHO, it's 
>a separate bug, not related to oid. Unfortunately, I've never seen such
>error on 
>my notebook.

Same issue. If the dead tuple is noticed by heap_delete (when it should have deleted the live version elsewhere) you
getthe invisible role error.  The unexpected chunk bit only triggers with hint bit sets differently across chunks.
 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From
Andres Freund
Date:
On 2016-05-10 17:58:33 -0700, Andres Freund wrote:
> FWIW, I've commented out the relevant sections from xlog_redo and since
> then I've not been able to reproduce the issue.

A couple days of running later, and it hasn't yet re-triggered. So I
assume that's indeed the fix.



On Mon, May 16, 2016 at 8:28 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-05-10 17:58:33 -0700, Andres Freund wrote:
>> FWIW, I've commented out the relevant sections from xlog_redo and since
>> then I've not been able to reproduce the issue.
>
> A couple days of running later, and it hasn't yet re-triggered. So I
> assume that's indeed the fix.

Yes, I've also run it to wrap-around a few times with that commented
out, so I agree that that is a fix. But I don't think we should apply
until we get the larger issue with hit-bits is fixed first, as then we
are masking one problem by fixing another.

Although it doesn't really matter to me, because I'm not using HEAD to
run the tests anyway, as I can't get the error to trigger on HEAD.  As
a side note, using 48354581a49c30f5 and without commenting out the
part of the replay code we are talking about, I also need
track_io_timing to be on in order to get the error.  I don't think
there is any direct causation there, just another fiddly timing issue.

I've tried fixing the hit-bit issue as well, but I'm afraid it is over
my head.  I'll be happy to test proposed patches, though.

Cheers,

Jeff



On 5/10/16 4:12 PM, Andres Freund wrote:
> The catalog representation (as in pg_class.reltoastrelid) isn't entirely
> clear to me.  One way would be to invert pg_class.reltoastrelid's
> meaning and have the toast table point to the table it stores values
> for. That'd also open the potential of having one toast table per column
> and such.

FWIW, toast-per-column is something I have a use case for.

Can we also consider using a per-toast-table sequence instead of OID? 
IIRC the generation mechanics of the two are similar, and that would 
greatly reduce the pressure on OID generation.

Tom, were you around when sequences were added? I'm guessing that that 
was done in response to OIDs becoming a serious problem in user tables 
on larger installs; ISTM this is just the next iteration of them being a 
problem. (And I suspect the one after this will be pg_attribute or maybe 
pg_depend).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



On Tue, May 10, 2016 at 02:06:19PM -0400, Tom Lane wrote:
> Sooner or later we are going to need to go to 8-byte TOAST object
> identifiers.  Maybe we should think about doing that sooner not
> later rather than trying to invent some anti-wraparound solution
> here.

Yay!  Is there any lift in separating TOAST OIDs from the rest?

> In principle, you could support existing TOAST tables and pointers
> containing 4-byte IDs in parallel with the new ones.

> Not sure how pg_upgrade would handle it exactly though.

This is yet another reason we should get away from in-place binary
upgrade as a strategy.  It's always been fragile, and it's only ever
been justifiable on grounds of hardware economics that no longer
obtain.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate