Thread: Index corruption

Index corruption

From
Marc Munro
Date:
We have now experienced index corruption on two separate but identical
slony clusters.  In each case the slony subscriber failed after
attempting to insert a duplicate record.  In each case reindexing the
sl_log_1 table on the provider fixed the problem.

The latest occurrence was on our production cluster yesterday.  This has
only happened since we performed kernel upgrades and we are uncertain
whether this represents a kernel bug, or a postgres bug exposed by
different timings in the new kernel.

Our systems are:

Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
kernel boot option: elevator=deadline
16 Gigs of RAM
postgresql-8.0.3-1PGDG
Bonded e1000/tg3 NICs with 8192 MTU.
Slony 1.1.0

NetApp FAS270 OnTap 7.0.3
Mounted with the NFS options
rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
Jumbo frames 8192 MTU.

All postgres data and logs are stored on the netapp.

In the latest episode, the index corruption was coincident with a
slony-induced vacuum.  I don't know if this was the case with our test
system failures.

What can we do to help identify the cause of this?  I believe we will be
able to reproduce this on a test system if there is some useful
investigation we can perform.

__
Marc

Re: Index corruption

From
Tom Lane
Date:
Marc Munro <marc@bloodnok.com> writes:
> We have now experienced index corruption on two separate but identical
> slony clusters.  In each case the slony subscriber failed after
> attempting to insert a duplicate record.  In each case reindexing the
> sl_log_1 table on the provider fixed the problem.

Please be more specific about what you mean by "index corruption"
... what were the exact symptoms?

> postgresql-8.0.3-1PGDG

The *first* thing you should do is update to 8.0.8 and see if the
problem is still there.
        regards, tom lane


Re: Index corruption

From
Marc Munro
Date:
On Tom Lane's advice, we upgraded to Postgres 8.0.8.  We also upgraded
slony to 1.1.5, due to some rpm issues.  Apart from that everything is
as described below.

We were able to corrupt the index within 90 minutes of starting up our
cluster.  A slony-induced vacuum was under way on the provider at the
time the subscriber failed.

What can we do to help identify the cause of this?  We have a test
system that seems able to reproduce this fairly easily.

__
Marc

On Wed, 2006-06-28 at 09:28 -0700, Marc Munro wrote:
> We have now experienced index corruption on two separate but identical
> slony clusters.  In each case the slony subscriber failed after
> attempting to insert a duplicate record.  In each case reindexing the
> sl_log_1 table on the provider fixed the problem.
>
> The latest occurrence was on our production cluster yesterday.  This has
> only happened since we performed kernel upgrades and we are uncertain
> whether this represents a kernel bug, or a postgres bug exposed by
> different timings in the new kernel.
>
> Our systems are:
>
> Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
> Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
> kernel boot option: elevator=deadline
> 16 Gigs of RAM
> postgresql-8.0.3-1PGDG
> Bonded e1000/tg3 NICs with 8192 MTU.
> Slony 1.1.0
>
> NetApp FAS270 OnTap 7.0.3
> Mounted with the NFS options
> rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
> Jumbo frames 8192 MTU.
>
> All postgres data and logs are stored on the netapp.
>
> In the latest episode, the index corruption was coincident with a
> slony-induced vacuum.  I don't know if this was the case with our test
> system failures.
>

> __
> Marc

Re: Index corruption

From
Tom Lane
Date:
Marc Munro <marc@bloodnok.com> writes:
> On Tom Lane's advice, we upgraded to Postgres 8.0.8.

OK, so it's not an already-known problem.

> We were able to corrupt the index within 90 minutes of starting up our
> cluster.  A slony-induced vacuum was under way on the provider at the
> time the subscriber failed.

You still haven't given any details.  What do you mean by "corrupt the
index" --- what actual symptoms are you seeing?
        regards, tom lane


Re: Index corruption

From
Marc Munro
Date:
On Thu, 2006-06-29 at 12:11 -0400, Tom Lane wrote:
> OK, so it's not an already-known problem.
>
> > We were able to corrupt the index within 90 minutes of starting up our
> > cluster.  A slony-induced vacuum was under way on the provider at the
> > time the subscriber failed.
>
> You still haven't given any details.  What do you mean by "corrupt the
> index" --- what actual symptoms are you seeing?

We have a two node slony cluster with load tests running againt the
provider node.

After resyncing the subscriber, and running load tests for about an
hour, the slony subscriber begins to fail.  From the log file:

2006-06-28 17:58:43 PDT ERROR  remoteWorkerThread_1: "insert into "public"."table_trans_attribute"
(table_transaction_id,attribute_type,value)values ('374520943','22001','0'); 
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values
('374520943','22002','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values
('374520943','22003','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values
('374520943','22004','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values
('374520943','22005','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values
('374520943','22006','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values
('374520943','22007','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values
('374520943','22007','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values
('374520942','22009','0');
insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values
('374520942','22010','0');
" ERROR:  duplicate key violates unique constraint "table_trans_attr_pk"

As you see, slony is attempting to enter one tuple
('374520943','22007','0') two times.

Each previous time we have had this problem, rebuilding the indexes on
slony log table (sl_log_1) has fixed the problem.  I have not reindexed
the table this time as I do not want to destroy any usable evidence.

__
Marc

Re: Index corruption

From
Tom Lane
Date:
Marc Munro <marc@bloodnok.com> writes:
> As you see, slony is attempting to enter one tuple
> ('374520943','22007','0') two times.
> Each previous time we have had this problem, rebuilding the indexes on
> slony log table (sl_log_1) has fixed the problem.  I have not reindexed
> the table this time as I do not want to destroy any usable evidence.

Good.  How big is the log table --- would it be possible for you to send
me the physical table and index files?  (Not a pg_dump, the actual disk
files)
        regards, tom lane


Re: Index corruption

From
Marc Munro
Date:
We have reproduced the problem again.  This time it looks like vacuum is
not part of the picture.  From the provider's log:

2006-06-29 14:02:41 CST DEBUG2 cleanupThread:  101.057 seconds for vacuuming

And from the subscriber's:

2006-06-29 13:00:43 PDT ERROR  remoteWorkerThread_1: "insert into "public"."table_trans_attribute"
(table_transaction_id,attribute_type,value)values ('374740387','22008','4000'); 

If my maths is correct and the logs are honest, the vacuum would have
started at 14:01:00 CST (13:01:PDT), about 20 seconds after we first
encounter the problem.  The clocks on the two machines, though in
different timezones, are currently synced.

Tom, I will create another tarball of the sl_log_1 table and indexes.
Should be quite a bit smaller than the previous one.

__
Marc

Re: Index corruption

From
Chris Browne
Date:
marc@bloodnok.com (Marc Munro) writes:
> As you see, slony is attempting to enter one tuple
> ('374520943','22007','0') two times.
>
> Each previous time we have had this problem, rebuilding the indexes on
> slony log table (sl_log_1) has fixed the problem.  I have not reindexed
> the table this time as I do not want to destroy any usable evidence.

We have seen this phenomenon on 7.4.8 several times; pulled dumps of
sl_log_1 and index files that Jan Wieck looked at, which alas hasn't
led to a fix.

He did, mind you, find some concurrency pattern that led, if memory
serves, to 7.4.12's release.  We had experienced cases where there was
some worse corruption that required that we rebuild replicas from
scratch :-(.
-- 
output = reverse("gro.mca" "@" "enworbbc")
http://cbbrowne.com/info/advocacy.html
"There  is no  psychiatrist in  the world  like a  puppy  licking your
face."  -- Ben Williams


Re: Index corruption

From
Tom Lane
Date:
Marc Munro <marc@bloodnok.com> writes:
> Tom,
> we have a newer and much smaller (35M) file showing the same thing:

Thanks.  Looking into this, what I find is that *both* indexes have
duplicated entries for the same heap tuple:

idx1:
Item 190 -- Length:   24  Offset: 3616 (0x0e20)  Flags: USED Block Id: 27806  linp Index: 2  Size: 24 Has Nulls: 0  Has
Varwidths:0
 
 0e20: 00009e6c 02001800 01000000 0057d123  ...l.........W.# 0e30: 1a781200 00000000                    .x......

Item 191 -- Length:   24  Offset: 3592 (0x0e08)  Flags: USED Block Id: 27806  linp Index: 2  Size: 24 Has Nulls: 0  Has
Varwidths:0
 
 0e08: 00009e6c 02001800 01000000 0057d123  ...l.........W.# 0e18: 2e781200 00000000                    .x......


idx2:
Item 127 -- Length:   16  Offset: 6144 (0x1800)  Flags: USED Block Id: 27806  linp Index: 2  Size: 16 Has Nulls: 0  Has
Varwidths:0
 
 1800: 00009e6c 02001000 0057d123 00000000  ...l.....W.#....
Item 128 -- Length:   16  Offset: 6128 (0x17f0)  Flags: USED Block Id: 27806  linp Index: 2  Size: 16 Has Nulls: 0  Has
Varwidths:0
 
 17f0: 00009e6c 02001000 0057d123 00000000  ...l.....W.#....

However, the two entries in idx1 contain different data!!

What I speculate right at the moment is that we are not looking at index
corruption at all, but at heap corruption: somehow, the first insertion
into ctid (27806,2) got lost and the same ctid got re-used for the next
inserted row.  We fixed one bug like this before ...
        regards, tom lane


Re: Index corruption

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2006-06-29 kell 16:42, kirjutas Chris Browne:
> marc@bloodnok.com (Marc Munro) writes:
> > As you see, slony is attempting to enter one tuple
> > ('374520943','22007','0') two times.
> >
> > Each previous time we have had this problem, rebuilding the indexes on
> > slony log table (sl_log_1) has fixed the problem.  I have not reindexed
> > the table this time as I do not want to destroy any usable evidence.
> 
> We have seen this phenomenon on 7.4.8 several times; pulled dumps of
> sl_log_1 and index files that Jan Wieck looked at, which alas hasn't
> led to a fix.
> 
> He did, mind you, find some concurrency pattern that led, if memory
> serves, to 7.4.12's release.  We had experienced cases where there was
> some worse corruption that required that we rebuild replicas from
> scratch :-(.

How well did you check the C-language triggers and special slony
functions for possibly corrupting some backend/shared-mem structures ?

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



Re: Index corruption

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2006-06-29 kell 17:23, kirjutas Tom Lane:
> Marc Munro <marc@bloodnok.com> writes:
> > Tom,
> > we have a newer and much smaller (35M) file showing the same thing:
> 
> Thanks.  Looking into this, what I find is that *both* indexes have
> duplicated entries for the same heap tuple:
> 
...
> However, the two entries in idx1 contain different data!!
> 
> What I speculate right at the moment is that we are not looking at index
> corruption at all, but at heap corruption: somehow, the first insertion
> into ctid (27806,2) got lost and the same ctid got re-used for the next
> inserted row.  We fixed one bug like this before ...

Marc: do you have triggers on some replicated tables ?

I remember having some corruption in a database with weird circular
trigger structures, some of them being slony log triggers. 

The thing that seemed to mess up something inside there, was when change
on parent rownt fired a trigger that changes child table rows and there
rows fired another trigger that changed the same parent row again.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Index corruption

From
Marc Munro
Date:
On Fri, 2006-06-30 at 00:37 +0300, Hannu Krosing wrote:
> Marc: do you have triggers on some replicated tables ?
>
We have a non-slony trigger on only 2 tables, neither of them involved
in this transaction.  We certainly have no circular trigger structures.

> I remember having some corruption in a database with weird circular
> trigger structures, some of them being slony log triggers.
>
> The thing that seemed to mess up something inside there, was when change
> on parent rownt fired a trigger that changes child table rows and there
> rows fired another trigger that changed the same parent row again.
>

__
Marc

Re: Index corruption

From
Tom Lane
Date:
I wrote:
> What I speculate right at the moment is that we are not looking at index
> corruption at all, but at heap corruption: somehow, the first insertion
> into ctid (27806,2) got lost and the same ctid got re-used for the next
> inserted row.  We fixed one bug like this before ...

Further study makes this look more probable.  It seems that most of the
activity around the trouble spot consists of transactions inserting
exactly 13 rows into the table; for instance here are traces of two
successive transactions:

Block    Item    XMIN        CMIN

27800    7    600921860    3
27800    8    600921860    7
27800    9    600921860    11
27800    10    600921860    15
27800    11    600921860    19
27800    12    600921860    23
27800    13    600921860    27
27800    14    600921860    31
27800    15    600921860    35
27800    16    600921860    39
27800    17    600921860    43
27800    18    600921860    47
27800    19    600921860    51

27800    20    600921870    3
27800    21    600921870    7
27800    22    600921870    11
27800    23    600921870    15
27800    24    600921870    19
27800    25    600921870    23
27800    26    600921870    27
27800    27    600921870    31
27800    28    600921870    35
27800    29    600921870    39
27800    30    600921870    43
27800    31    600921870    47
27800    32    600921870    51

The pattern of CMIN values is the same for all these transactions.
But look at the rows inserted by 600921856 and 600921858, the two
transactions that seem to be involved with the problem on page
27086:

27787    50    600921856    3
27795    41    600921858    3
27795    42    600921858    7
27795    43    600921858    11
27795    44    600921858    15
27795    45    600921858    19
27795    46    600921858    23
27795    47    600921858    27
27795    48    600921858    31
27795    49    600921858    35
27795    50    600921858    39
27795    51    600921858    43
27806    1    600921858    47
27806    2    600921856    15
27806    3    600921856    19
27806    4    600921856    23
27806    5    600921856    27
27806    6    600921856    31
27806    7    600921856    35
27806    8    600921856    39
27806    9    600921856    43
27806    10    600921856    47
27806    11    600921856    51
27806    12    600921858    51

(27787,50) and (27795,51) are both the last rows on their pages.  What's
evidently happened is that the two transactions filled those pages and
then both seized on 27806 as the next page to insert into.  I think that
600921856 tried to insert its CMIN 7 and 11 rows as items 1 and 2 on
that page, and then something wiped the page, then 600921858 inserted
its CMIN 47 row as item 1, and then 600921856 got control back and
finished inserting its rows.  Further study of the indexes shows that
there are two entries in each index pointing to each of (27806,1) and
(27806,2) --- but since the xxid values are different for the two
(27806,1) entries, those didn't show up as duplicates in my first look.

Given the apparent connection to vacuuming, this is looking a WHOLE
lot like this bug fixed in 8.0.3:

2005-05-07 17:32  tgl
* src/backend/: access/heap/hio.c, commands/vacuumlazy.c(REL7_3_STABLE), access/heap/hio.c,
access/nbtree/nbtpage.c,access/nbtree/nbtree.c,commands/vacuumlazy.c (REL7_4_STABLE),access/heap/hio.c,
commands/vacuumlazy.c(REL7_2_STABLE),access/heap/hio.c, access/nbtree/nbtpage.c,
access/nbtree/nbtree.c,commands/vacuumlazy.c(REL8_0_STABLE), access/heap/hio.c,access/nbtree/nbtpage.c,
access/nbtree/nbtree.c,commands/vacuumlazy.c:Repair very-low-probability race conditionbetween relation extension and
VACUUM:in the interval betweenadding a new page to the relation and formatting it, it waspossible for VACUUM to come
alongand decide it should format thepage too.  Though not harmful in itself, this would cause data lossif a third
transactionwere able to insert tuples into the vacuumedpage before the original extender got control back.
 

Are you *certain* this slave isn't running 8.0.2 or older?  If you can
verify that, then I guess we need to look for another mechanism that
could cause the same kind of thing.
        regards, tom lane


Re: Index corruption

From
Tom Lane
Date:
Marc Munro <marc@bloodnok.com> writes:
> If there's anything we can do to help debug this we will.  We can apply
> patches, different build options, etc.

One easy thing that would be worth trying is to build with
--enable-cassert and see if any Asserts get provoked during the
failure case.  I don't have a lot of hope for that, but it's
something that would require only machine time not people time.

A couple other things to try, given that you can provoke the failure
fairly easily:

1. In studying the code, it bothers me a bit that P_NEW is the same as
InvalidBlockNumber.  The intended uses of P_NEW appear to be adequately
interlocked, but it's fairly easy to see how something like this could
happen if there are any places where InvalidBlockNumber is
unintentionally passed to ReadBuffer --- that would look like a P_NEW
call and it *wouldn't* be interlocked.  So it would be worth changing
P_NEW to "(-2)" (this should just take a change in bufmgr.h and
recompile) and adding an "Assert(blockNum != InvalidBlockNumber)"
at the head of ReadBufferInternal().  Then rebuild with asserts enabled
and see if the failure case provokes that assert.

2. I'm also eyeing this bit of code in hio.c:
       /*        * If the FSM knows nothing of the rel, try the last page before        * we give up and extend.  This
avoidsone-tuple-per-page syndrome        * during bootstrapping or in a recently-started system.        */       if
(targetBlock== InvalidBlockNumber)       {           BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
 
           if (nblocks > 0)               targetBlock = nblocks - 1;       }

If someone else has just extended the relation, it's possible that this
will allow a process to get to the page before the intended extender has
finished initializing it.  AFAICT that's not harmful because the page
will look like it has no free space ... but it seems a bit fragile.
If you dike out the above-mentioned code, can you still provoke the
failure?

A different line of attack is to see if you can make a self-contained
test case so other people can try to reproduce it.  More eyeballs on the
problem are always better.

Lastly, it might be interesting to look at the WAL logs for the period
leading up to a failure.  This would give us an idea of what was
happening concurrently with the processes that seem directly involved.
        regards, tom lane


Re: Index corruption

From
Tom Lane
Date:
[ back to the start of the thread... ]

Marc Munro <marc@bloodnok.com> writes:
> We have now experienced index corruption on two separate but identical
> slony clusters.  In each case the slony subscriber failed after
> attempting to insert a duplicate record.  In each case reindexing the
> sl_log_1 table on the provider fixed the problem.

> Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
> Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux
> ...
> NetApp FAS270 OnTap 7.0.3
> Mounted with the NFS options
> rw,nfsvers=3D3,hard,rsize=3D32768,wsize=3D32768,timeo=3D600,tcp,noac
> Jumbo frames 8192 MTU.
> All postgres data and logs are stored on the netapp.

BTW, a couple of thoughts here:

* If my theory about the low-level cause is correct, then reindexing
sl_log_1 would make the "duplicate key" errors go away, but nonetheless
you'd have lost data --- the overwritten rows would be gone.  I suppose
that this would result in the slave missing some rows that are present
on the master.  Have you tried comparing slave and master databases to
see if you can find any discrepancies?

* One way that the problem could happen would be if a race condition in
the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than
the true end-of-file (as determined by another process' write()
extending the EOF just slightly earlier --- ie, lseek fails to note the
effects of the just-completed write, and returns the prior EOF value).
PG does have internal locking that should guarantee that the lseek is
not done until after the write completes ... but could there be a bug in
the kernel allowing stale data to be returned?  The SMP hardware is
relevant (maybe one processor sees different data than the other) and
frankly I don't trust NFS very far at all for questions such as this.
It'd be interesting to see if you can reproduce the problem in a
database on local storage.
        regards, tom lane


Re: Index corruption

From
Marc Munro
Date:
On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote:
> One easy thing that would be worth trying is to build with
> --enable-cassert and see if any Asserts get provoked during the
> failure case.  I don't have a lot of hope for that, but it's
> something that would require only machine time not people time.

I'll try this tomorrow.

> A couple other things to try, given that you can provoke the failure
> fairly easily:
>
> 1. In studying the code, it bothers me a bit that P_NEW is the same as
> InvalidBlockNumber.  The intended uses of P_NEW appear to be adequately
> interlocked, but it's fairly easy to see how something like this could
> happen if there are any places where InvalidBlockNumber is
> unintentionally passed to ReadBuffer --- that would look like a P_NEW
> call and it *wouldn't* be interlocked.  So it would be worth changing
> P_NEW to "(-2)" (this should just take a change in bufmgr.h and
> recompile) and adding an "Assert(blockNum != InvalidBlockNumber)"
> at the head of ReadBufferInternal().  Then rebuild with asserts enabled
> and see if the failure case provokes that assert.

I'll try this too.

> 2. I'm also eyeing this bit of code in hio.c:
>
>         /*
>          * If the FSM knows nothing of the rel, try the last page before
>          * we give up and extend.  This avoids one-tuple-per-page syndrome
>          * during bootstrapping or in a recently-started system.
>          */
>         if (targetBlock == InvalidBlockNumber)
>         {
>             BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
>
>             if (nblocks > 0)
>                 targetBlock = nblocks - 1;
>         }
>
> If someone else has just extended the relation, it's possible that this
> will allow a process to get to the page before the intended extender has
> finished initializing it.  AFAICT that's not harmful because the page
> will look like it has no free space ... but it seems a bit fragile.
> If you dike out the above-mentioned code, can you still provoke the
> failure?

By dike out, you mean remove?  Please confirm and I'll try it.

> A different line of attack is to see if you can make a self-contained
> test case so other people can try to reproduce it.  More eyeballs on the
> problem are always better.

Can't really see this being possible.  This is clearly a very unusual
problem and without similar hardware I doubt that anyone else will
trigger it.  We ran this system happily for nearly a year on the
previous kernel without experiencing this problem (tcp lockups are a
different matter).  Also the load is provided by a bunch of servers and
robots simulating rising and falling load.

> Lastly, it might be interesting to look at the WAL logs for the period
> leading up to a failure.  This would give us an idea of what was
> happening concurrently with the processes that seem directly involved.

Next time we reproduce it, I'll take a copy of the WAL files too.

__
Marc

Re: Index corruption

From
Tom Lane
Date:
Marc Munro <marc@bloodnok.com> writes:
> By dike out, you mean remove?  Please confirm and I'll try it.

Right, just remove (or comment out) the lines I quoted.

> We ran this system happily for nearly a year on the
> previous kernel without experiencing this problem (tcp lockups are a
> different matter).

I'm starting to think that a kernel bug is a realistic explanation,
see other message.  Whose kernel build is this exactly?
        regards, tom lane


Re: Index corruption

From
Marc Munro
Date:
On Thu, 2006-06-29 at 21:59 -0400, Tom Lane wrote:
> [ back to the start of the thread... ]
>
> BTW, a couple of thoughts here:
>
> * If my theory about the low-level cause is correct, then reindexing
> sl_log_1 would make the "duplicate key" errors go away, but nonetheless
> you'd have lost data --- the overwritten rows would be gone.  I suppose
> that this would result in the slave missing some rows that are present
> on the master.  Have you tried comparing slave and master databases to
> see if you can find any discrepancies?

Haven't done that yet - in test we tend to restart the old subscriber as
the new provider and rebuild the cluster.  I'll check the logs from our
production failure to figure out what to compare and see what I can
discover.

> * One way that the problem could happen would be if a race condition in
> the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than
> the true end-of-file (as determined by another process' write()
> extending the EOF just slightly earlier --- ie, lseek fails to note the
> effects of the just-completed write, and returns the prior EOF value).
> PG does have internal locking that should guarantee that the lseek is
> not done until after the write completes ... but could there be a bug in
> the kernel allowing stale data to be returned?  The SMP hardware is
> relevant (maybe one processor sees different data than the other) and
> frankly I don't trust NFS very far at all for questions such as this.
> It'd be interesting to see if you can reproduce the problem in a
> database on local storage.

Unfortunately we haven't got any local storage that can stand the sort
of loads we are putting through.  With slower storage the CPUs mostly
sit idle and we are very unlikely to trigger a timing-based bug if
that's what it is.

I'll get back to you with kernel build information tomorrow.  We'll also
try to talk to some kernel hackers about this.

Many thanks for your efforts so far.
__
Marc

Re: Index corruption

From
Tom Lane
Date:
Marc Munro <marc@bloodnok.com> writes:
> I'll get back to you with kernel build information tomorrow.  We'll also
> try to talk to some kernel hackers about this.

Some googling turned up recent discussions about race conditions in
Linux NFS code:

http://threebit.net/mail-archive/linux-kernel/msg05313.html
http://lkml.org/lkml/2006/3/1/381

I don't know the kernel nearly well enough to guess if these are related
...
        regards, tom lane


Re: Index corruption

From
"Marko Kreen"
Date:
On 6/30/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't know the kernel nearly well enough to guess if these are related
> ...

The sl_log_* tables are indexed on xid, where the relations between
values are not exactly stable.  When having high enough activity on
one node or having nodes with XIDs on different enough positions
funny things happen.

Have you already excluded this as a probable reason?

-- 
marko


Re: Index corruption

From
Tom Lane
Date:
"Marko Kreen" <markokr@gmail.com> writes:
> The sl_log_* tables are indexed on xid, where the relations between
> values are not exactly stable.  When having high enough activity on
> one node or having nodes with XIDs on different enough positions
> funny things happen.

Yeah, that was one of the first things I thought about, but the range of
XIDs involved in these test cases isn't large enough to involve a
wraparound, and anyway it's now looking like the problem is loss of heap
entries, not index corruption at all.

Slony's use of XID comparison semantics for indexes is definitely pretty
scary though.  If I were them I'd find a way to get rid of it.  In
theory, since the table is only supposed to contain "recent" XIDs,
as long as you keep it vacuumed the index should never contain any
inconsistently-comparable XIDs ... but in a big index, the boundary keys
for upper-level index pages might hang around an awful long time ...
        regards, tom lane


Re: Index corruption

From
Tom Lane
Date:
I trawled through the first, larger dump you sent me, and found multiple
index entries pointing to quite a few heap tuples:
Occurrences    block    item
2        43961    12        43961    22        43961    32        43961    42        43961    52        43961    62
  43961    72        43961    82        43961    92        119695    12        119695    22        119695    32
126029   12        126029    22        126029    32        166694    12        166865    12        166865    22
166865   32        166865    42        166865    52        166865    62        166865    72        206221    12
247123   12        327775    12        327775    22        327775    32        327775    42        327775    52
327775   62        327775    72        327775    82        327775    92        327775    102        327775    11
 

Both indexes show identical sets of duplicates, which makes it pretty
hard to credit that it's a within-index problem.

You mentioned that the test had been allowed to run for a good while
after the first slave error was noted.  So it seems there's no question
that we are looking at some mechanism that allows the first few entries
on a heap page to be lost and overwritten :-(, and that this happened
several times over the course of the larger run.
        regards, tom lane


Re: Index corruption

From
Jan Wieck
Date:
On 6/30/2006 9:55 AM, Tom Lane wrote:
> "Marko Kreen" <markokr@gmail.com> writes:
>> The sl_log_* tables are indexed on xid, where the relations between
>> values are not exactly stable.  When having high enough activity on
>> one node or having nodes with XIDs on different enough positions
>> funny things happen.
> 
> Yeah, that was one of the first things I thought about, but the range of
> XIDs involved in these test cases isn't large enough to involve a
> wraparound, and anyway it's now looking like the problem is loss of heap
> entries, not index corruption at all.
> 
> Slony's use of XID comparison semantics for indexes is definitely pretty
> scary though.  If I were them I'd find a way to get rid of it.  In
> theory, since the table is only supposed to contain "recent" XIDs,
> as long as you keep it vacuumed the index should never contain any
> inconsistently-comparable XIDs ... but in a big index, the boundary keys
> for upper-level index pages might hang around an awful long time ...

With the final implementation of log switching, the problem of xxid 
wraparound will be avoided entirely. Every now and then slony will 
switch from one to another log table and when the old one is drained and 
logically empty, it is truncated, which should reinitialize all indexes.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Index corruption

From
"Marko Kreen"
Date:
On 6/30/06, Jan Wieck <JanWieck@yahoo.com> wrote:
> With the final implementation of log switching, the problem of xxid
> wraparound will be avoided entirely. Every now and then slony will
> switch from one to another log table and when the old one is drained and
> logically empty, it is truncated, which should reinitialize all indexes.

If the xxid-s come from different DB-s, then there can still be problems.

-- 
marko


Re: Index corruption

From
Jan Wieck
Date:
On 6/30/2006 11:17 AM, Marko Kreen wrote:

> On 6/30/06, Jan Wieck <JanWieck@yahoo.com> wrote:
>> With the final implementation of log switching, the problem of xxid
>> wraparound will be avoided entirely. Every now and then slony will
>> switch from one to another log table and when the old one is drained and
>> logically empty, it is truncated, which should reinitialize all indexes.
> 
> If the xxid-s come from different DB-s, then there can still be problems.

How so? They are allways part of a multi-key index having the 
originating node ID first.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Index corruption

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> On 6/30/2006 11:17 AM, Marko Kreen wrote:
>> If the xxid-s come from different DB-s, then there can still be problems.

> How so? They are allways part of a multi-key index having the 
> originating node ID first.

Really?

create table @NAMESPACE@.sl_log_1 (log_origin            int4,log_xid                @NAMESPACE@.xxid,log_tableid
    int4,log_actionseq        int8,log_cmdtype            char,log_cmddata            text
 
);
create index sl_log_1_idx1 on @NAMESPACE@.sl_log_1(log_origin, log_xid @NAMESPACE@.xxid_ops, log_actionseq);

create index sl_log_1_idx2 on @NAMESPACE@.sl_log_1(log_xid @NAMESPACE@.xxid_ops);

sl_log_1_idx2 doesn't seem to have any such protection.  When I was
poking at Marc's example, though, it seemed that the numbers going
into the table were all *locally generated* XIDs, in fact the same
as the XID doing the insertions.  If this is only true on the master,
and slaves can be inserting XIDs coming from different masters,
then I think it will break.
        regards, tom lane


Re: Index corruption

From
Brad Nicholson
Date:
Tom Lane wrote:
> Marc Munro <marc@bloodnok.com> writes:
>> I'll get back to you with kernel build information tomorrow.  We'll also
>> try to talk to some kernel hackers about this.
> 
> Some googling turned up recent discussions about race conditions in
> Linux NFS code:
> 
> http://threebit.net/mail-archive/linux-kernel/msg05313.html
> http://lkml.org/lkml/2006/3/1/381
> 
> I don't know the kernel nearly well enough to guess if these are related
> ...

It may or may not be the same issue, but for what it's worth, we've seen the same sl_log_1 corruption on AIX 5.1 and
5.3

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



Re: Index corruption

From
Jan Wieck
Date:
On 6/30/2006 11:55 AM, Tom Lane wrote:

> Jan Wieck <JanWieck@Yahoo.com> writes:
>> On 6/30/2006 11:17 AM, Marko Kreen wrote:
>>> If the xxid-s come from different DB-s, then there can still be problems.
> 
>> How so? They are allways part of a multi-key index having the 
>> originating node ID first.
> 
> Really?
> 
> create table @NAMESPACE@.sl_log_1 (
>     log_origin            int4,
>     log_xid                @NAMESPACE@.xxid,
>     log_tableid            int4,
>     log_actionseq        int8,
>     log_cmdtype            char,
>     log_cmddata            text
> );
> create index sl_log_1_idx1 on @NAMESPACE@.sl_log_1
>     (log_origin, log_xid @NAMESPACE@.xxid_ops, log_actionseq);
> 
> create index sl_log_1_idx2 on @NAMESPACE@.sl_log_1
>     (log_xid @NAMESPACE@.xxid_ops);

You're right ... forgot about that one. And yes, there can be 
transactions originating from multiple origins (masters) in the same 
log. The thing is, the index is only there because in a single origin 
situation (most installations are), the log_origin is allways the same. 
The optimizer therefore sometimes didn't think using an index at all 
would be good.

However, transactions from different origins are NEVER selected together 
and it wouldn't make sense to compare their xid's anyway. So the index 
might return index tuples for rows from another origin, but the 
following qualifications against the log_origin in the heap tuple will 
filter them out.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: Index corruption

From
Brad Nicholson
Date:
Brad Nicholson wrote:
> Tom Lane wrote:
>> Marc Munro <marc@bloodnok.com> writes:
>>> I'll get back to you with kernel build information tomorrow.  We'll also
>>> try to talk to some kernel hackers about this.
>> Some googling turned up recent discussions about race conditions in
>> Linux NFS code:
>>
>> http://threebit.net/mail-archive/linux-kernel/msg05313.html
>> http://lkml.org/lkml/2006/3/1/381
>>
>> I don't know the kernel nearly well enough to guess if these are related
>> ...
> 
> It may or may not be the same issue, but for what it's worth, we've seen
>   the same sl_log_1 corruption on AIX 5.1 and 5.3
> 

On 7.4.12, I should add.

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



Re: Index corruption

From
Tom Lane
Date:
Brad Nicholson <bnichols@ca.afilias.info> writes:
> It may or may not be the same issue, but for what it's worth, we've seen
>   the same sl_log_1 corruption on AIX 5.1 and 5.3

Hm, on what filesystem, and what PG version(s)?

I'm not completely satisfied by the its-a-kernel-bug theory, because if
it were then ISTM extending an index would be subject to the same risks
as extending a table; but I see no evidence of index page lossage in
Marc's dump.  OTOH the usage patterns are different, so maybe PG isn't
stressing the write-to-lseek path quite as hard for indexes.
        regards, tom lane


Re: Index corruption

From
Brad Nicholson
Date:
Tom Lane wrote:
> Brad Nicholson <bnichols@ca.afilias.info> writes:
>> It may or may not be the same issue, but for what it's worth, we've seen
>>   the same sl_log_1 corruption on AIX 5.1 and 5.3
> 
> Hm, on what filesystem, and what PG version(s)?
> 
> I'm not completely satisfied by the its-a-kernel-bug theory, because if
> it were then ISTM extending an index would be subject to the same risks
> as extending a table; but I see no evidence of index page lossage in
> Marc's dump.  OTOH the usage patterns are different, so maybe PG isn't
> stressing the write-to-lseek path quite as hard for indexes.
> 
>             regards, tom lane

jfs2 in all cases.  I don't recall the PG version for 5.1, but it was
greater that 7.4.8.  For 5.3, it was 7.4.12.

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



Re: Index corruption

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> You're right ... forgot about that one.

> However, transactions from different origins are NEVER selected together 
> and it wouldn't make sense to compare their xid's anyway. So the index 
> might return index tuples for rows from another origin, but the 
> following qualifications against the log_origin in the heap tuple will 
> filter them out.

No, that's not the point here.  The point here is that if the xids that
are simultaneously present in the index span more than a 2G-XID range,
btree *will fail* because it will be dealing with keys that do not obey
the transitive law.  You do have a problem --- it doesn't explain Marc's
troubles, but sl_log_1_idx2 is broken for multi master situations.  All
you need is masters with sufficiently different XID counters.
        regards, tom lane


Re: Index corruption

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2006-06-30 kell 12:05, kirjutas Jan Wieck:
> On 6/30/2006 11:55 AM, Tom Lane wrote:
> 
> > Jan Wieck <JanWieck@Yahoo.com> writes:
> >> On 6/30/2006 11:17 AM, Marko Kreen wrote:
> >>> If the xxid-s come from different DB-s, then there can still be problems.
> > 
> >> How so? They are allways part of a multi-key index having the 
> >> originating node ID first.
> > 
> > Really?
> > 
> > create table @NAMESPACE@.sl_log_1 (
> >     log_origin            int4,
> >     log_xid                @NAMESPACE@.xxid,
> >     log_tableid            int4,
> >     log_actionseq        int8,
> >     log_cmdtype            char,
> >     log_cmddata            text
> > );
> > create index sl_log_1_idx1 on @NAMESPACE@.sl_log_1
> >     (log_origin, log_xid @NAMESPACE@.xxid_ops, log_actionseq);
> > 
> > create index sl_log_1_idx2 on @NAMESPACE@.sl_log_1
> >     (log_xid @NAMESPACE@.xxid_ops);
> 
> You're right ... forgot about that one. And yes, there can be 
> transactions originating from multiple origins (masters) in the same 
> log. The thing is, the index is only there because in a single origin 
> situation (most installations are), the log_origin is allways the same. 
> The optimizer therefore sometimes didn't think using an index at all 
> would be good.
> 
> However, transactions from different origins are NEVER selected together 
> and it wouldn't make sense to compare their xid's anyway. So the index 
> might return index tuples for rows from another origin, but the 
> following qualifications against the log_origin in the heap tuple will 
> filter them out.

The problem was not only with returning too many rows from tuples, but
as much returning too few. In case when you return too few rows some
actions will just be left out from replication and thus will be missing
from slaves.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com





Re: Index corruption

From
Marc Munro
Date:
On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote:
> One easy thing that would be worth trying is to build with
> --enable-cassert and see if any Asserts get provoked during the

> A couple other things to try, given that you can provoke the failure
> fairly easily:
> . . .
> 1. In studying the code, it bothers me a bit that P_NEW is the same as
> InvalidBlockNumber.  The intended uses of P_NEW appear to be adequately
> . . .
> 2. I'm also eyeing this bit of code in hio.c:
>
> If someone else has just extended the relation, it's possible that this
> will allow a process to get to the page before the intended extender has
> . . .

We tried all of these suggestions and still get the problem.  Nothing
interesting in the log file so I guess the Asserts did not fire.

We are going to try experimenting with different kernels now.  Unless
anyone has any other suggestions.

__
Marc

Re: Index corruption

From
Tom Lane
Date:
Marc Munro <marc@bloodnok.com> writes:
> We tried all of these suggestions and still get the problem.  Nothing
> interesting in the log file so I guess the Asserts did not fire.

Not surprising, it was a long shot that any of those things were really
broken.  But worth testing.

> We are going to try experimenting with different kernels now.  Unless
> anyone has any other suggestions.

Right at the moment I have no better ideas :-(
        regards, tom lane


Re: Index corruption

From
Marc Munro
Date:
For the record, here are the results of our (ongoing) inevstigation into
the index/heap corruption problems I reported a couple of weeks ago.

We were able to trigger the problem with kernels 2.6.16, 2.6.17 and
2.6.18.rc1, with 2.6.16 seeming to be the most flaky.

By replacing the NFS-mounted netapp with a fibre-channel SAN, we have
eliminated the problem on all kernels.

From this, it would seem to be an NFS bug introduced post 2.6.14, though
we cannot rule out a postgres bug exposed by unusual timing issues.

Our starting systems are:

Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux (and others)
kernel boot option: elevator=deadline
16 Gigs of RAM
postgresql-8.0.8-1PGDG
Bonded e1000/tg3 NICs with 8192 MTU.
Slony 1.1.5

NetApp FAS270 OnTap 7.0.3
Mounted with the NFS options
rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
Jumbo frames 8192 MTU.

All postgres data and logs are stored on the netapp.

All tests results were reproduced with postgres 8.0.8

__
Marc

On Fri, 2006-06-30 at 23:20 -0400, Tom Lane wrote:
> Marc Munro <marc@bloodnok.com> writes:
> > We tried all of these suggestions and still get the problem.  Nothing
> > interesting in the log file so I guess the Asserts did not fire.
>
> Not surprising, it was a long shot that any of those things were really
> broken.  But worth testing.
>
> > We are going to try experimenting with different kernels now.  Unless
> > anyone has any other suggestions.
>
> Right at the moment I have no better ideas :-(
>
>             regards, tom lane
>

always denying corruption

From
mdean
Date:
Marc Munro wrote:

>For the record, here are the results of our (ongoing) inevstigation into
>the index/heap corruption problems I reported a couple of weeks ago.
>
>We were able to trigger the problem with kernels 2.6.16, 2.6.17 and
>2.6.18.rc1, with 2.6.16 seeming to be the most flaky.
>
>By replacing the NFS-mounted netapp with a fibre-channel SAN, we have
>eliminated the problem on all kernels.
>
>From this, it would seem to be an NFS bug introduced post 2.6.14, though
>we cannot rule out a postgres bug exposed by unusual timing issues.
>
>Our starting systems are: 
>
>Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875
>Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux (and others)
>kernel boot option: elevator=deadline
>16 Gigs of RAM
>postgresql-8.0.8-1PGDG
>Bonded e1000/tg3 NICs with 8192 MTU.
>Slony 1.1.5
>
>NetApp FAS270 OnTap 7.0.3
>Mounted with the NFS options
>rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac
>Jumbo frames 8192 MTU.
>
>All postgres data and logs are stored on the netapp.
>
>All tests results were reproduced with postgres 8.0.8
>
>__
>Marc
>
>On Fri, 2006-06-30 at 23:20 -0400, Tom Lane wrote:
>  
>
>>Marc Munro <marc@bloodnok.com> writes:
>>    
>>
>>>We tried all of these suggestions and still get the problem.  Nothing
>>>interesting in the log file so I guess the Asserts did not fire.
>>>      
>>>
>>Not surprising, it was a long shot that any of those things were really
>>broken.  But worth testing.
>>
>>    
>>
>>>We are going to try experimenting with different kernels now.  Unless
>>>anyone has any other suggestions.
>>>      
>>>
>>Right at the moment I have no better ideas :-(
>>
>>            regards, tom lane
>>    
>>
>
>  
>
On a good stock day, some levity is justified.  How are hackers like 
politicians?


-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.1/390 - Release Date: 7/17/2006