Thread: Indexes getting corrupted.

Indexes getting corrupted.

From
"Bruno G. Albuquerque"
Date:
I am having a weird problem here. I have the automated process to
install PostgreSQL (8.0.1) on Windows 2000 machines. Besides installing
the database server itself, my process does the following:

1 - Runs initdb to created the database I will be using.
2 - Runs a SQL script that creates the entire database and populates
some of the tables.
3 - Runs another script that populates other tables (using COPY).

What happens is that, sometimes, after the entire process run, I end up
with a database with virtually all indexes corrupt! Try to use the
indexes for anything results in "index is not a btree" error. This
happens randomly and I can get it fixed sometimes by simply erasin
everything and starting over.

It happened in more thna one machine, if that matters. Also, changing
the database from one partition to another almost all the time solves it
(all machines have 2 partitions).

Any ideas about what can be causing this? Manually doing "reindex index"
gets the indexes sorted. Is there a way to automatically reindex all
indexes in one go? that would at least help me a bit.

Thanks in advance.

-Bruno

**********************************************************************
Informação transmitida destina-se apenas à pessoa a quem foi endereçada e pode conter informação confidencial,
legalmenteprotegida e para conhecimento exclusivo do destinatário. Se o leitor desta advertência não for o seu
destinatário,fica ciente de que sua leitura, divulgação ou cópia é estritamente proibida. Caso a mensagem tenha sido
recebidapor engano, favor comunicar ao remetente e apagar o texto de qualquer computador. 


The information transmitted is intended only for the person or entity to which it is addressed and may contain
confidentialand/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any
actionin reliance upon this information, by person or entity other than the intended recipient is prohibited. If you
receivedthis in error, please contact the sender and delete the material from any computer. 
**********************************************************************

Re: Indexes getting corrupted.

From
Tom Arthurs
Date:
Reindexdb in contrib should help with rebuilding indexes.

Bruno G. Albuquerque wrote:
>
> I am having a weird problem here. I have the automated process to
> install PostgreSQL (8.0.1) on Windows 2000 machines. Besides installing
> the database server itself, my process does the following:
>
> 1 - Runs initdb to created the database I will be using.
> 2 - Runs a SQL script that creates the entire database and populates
> some of the tables.
> 3 - Runs another script that populates other tables (using COPY).
>
> What happens is that, sometimes, after the entire process run, I end up
> with a database with virtually all indexes corrupt! Try to use the
> indexes for anything results in "index is not a btree" error. This
> happens randomly and I can get it fixed sometimes by simply erasin
> everything and starting over.
>
> It happened in more thna one machine, if that matters. Also, changing
> the database from one partition to another almost all the time solves it
> (all machines have 2 partitions).
>
> Any ideas about what can be causing this? Manually doing "reindex index"
> gets the indexes sorted. Is there a way to automatically reindex all
> indexes in one go? that would at least help me a bit.
>
> Thanks in advance.
>
> -Bruno
>
> **********************************************************************
> Informação transmitida destina-se apenas à pessoa a quem foi endereçada
> e pode conter informação confidencial, legalmente protegida e para
> conhecimento exclusivo do destinatário. Se o leitor desta advertência
> não for o seu destinatário, fica ciente de que sua leitura, divulgação
> ou cópia é estritamente proibida. Caso a mensagem tenha sido recebida
> por engano, favor comunicar ao remetente e apagar o texto de qualquer
> computador.
>
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon this information, by person or
> entity other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from
> any computer.
> **********************************************************************
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>
>
>

Re: Indexes getting corrupted.

From
Scott Marlowe
Date:
On Wed, 2005-06-08 at 13:39, Bruno G. Albuquerque wrote:
> I am having a weird problem here. I have the automated process to
> install PostgreSQL (8.0.1) on Windows 2000 machines. Besides installing
> the database server itself, my process does the following:
>
> 1 - Runs initdb to created the database I will be using.
> 2 - Runs a SQL script that creates the entire database and populates
> some of the tables.
> 3 - Runs another script that populates other tables (using COPY).
>
> What happens is that, sometimes, after the entire process run, I end up
> with a database with virtually all indexes corrupt! Try to use the
> indexes for anything results in "index is not a btree" error. This
> happens randomly and I can get it fixed sometimes by simply erasin
> everything and starting over.

Note that such behaviour is normally associated with bad hardware (RAM,
hard drive, controller, CPU, you name it...)  But since the windows port
is pretty new, you may have found some corner case.

Do you have a "pre-packaged" test case that induces failures most of the
time or something like that?

Are these machines all basically make/model/brand/chipset etc... of
machines?  Then it might be a buggy driver or something.

Re: Indexes getting corrupted.

From
"Bruno G. Albuquerque"
Date:


Scott Marlowe wrote:

>>I am having a weird problem here. I have the automated process to
>>install PostgreSQL (8.0.1) on Windows 2000 machines. Besides installing
>>the database server itself, my process does the following:
>>
>>1 - Runs initdb to created the database I will be using.
>>2 - Runs a SQL script that creates the entire database and populates
>>some of the tables.
>>3 - Runs another script that populates other tables (using COPY).
>>
>>What happens is that, sometimes, after the entire process run, I end up
>>with a database with virtually all indexes corrupt! Try to use the
>>indexes for anything results in "index is not a btree" error. This
>>happens randomly and I can get it fixed sometimes by simply erasin
>>everything and starting over.
>
>
> Note that such behaviour is normally associated with bad hardware (RAM,
> hard drive, controller, CPU, you name it...)  But since the windows port
> is pretty new, you may have found some corner case.

I think I figured it out. The problem is related to rebooting the
machine. Here are the results of my testing:

- Stopping the service before rebooting (log file contents)

[...]
2005-06-10 11:44:40 LOG:  received fast shutdown request
2005-06-10 11:44:40 LOG:  shutting down
2005-06-10 11:44:40 LOG:  database system is shut down
2005-06-10 11:44:41 LOG:  logger shutting down

- After rebooting:

2005-06-10 11:47:25 LOG:  database system was shut down at 2005-06-10
11:44:40 Horário padrão CAIXA -3
2005-06-10 11:47:25 LOG:  checkpoint record is at 0/16D35A8
2005-06-10 11:47:25 LOG:  redo record is at 0/16D35A8; undo record is at
0/0; shutdown TRUE
2005-06-10 11:47:25 LOG:  next transaction ID: 1487; next OID: 64574
2005-06-10 11:47:26 LOG:  database system is ready

- Rebooting the machine without stopping the service:

[...]
2005-06-10 11:39:30 LOG:  received fast shutdown request

- After rebooting:

2005-06-10 11:41:46 LOG:  database system was interrupted at 2005-06-10
11:37:22 Horário padrão CAIXA -3
2005-06-10 11:41:46 LOG:  checkpoint record is at 0/AC30F0
2005-06-10 11:41:46 LOG:  redo record is at 0/AC30F0; undo record is at
0/0; shutdown TRUE
2005-06-10 11:41:46 LOG:  next transaction ID: 546; next OID: 17230
2005-06-10 11:41:46 LOG:  database system was not properly shut down;
automatic recovery in progress
2005-06-10 11:41:47 LOG:  redo starts at 0/AC3130
2005-06-10 11:42:09 LOG:  record with zero length at 0/16C8698
2005-06-10 11:42:09 LOG:  redo done at 0/16C8668
2005-06-10 11:42:15 LOG:  database system is ready
2005-06-10 11:42:38 ERROR:  index "pk_cubtb021_localidade" is not a btree
2005-06-10 11:43:08 ERROR:  index "pk_cubtb021_localidade" is not a btree
2005-06-10 11:43:09 ERROR:  index "pk_cubtb038_ocupacao" is not a btree

The btree errors are the ones I am getting when running the client
application that connets to this database.

So, it seems that 2 things are happening:

1 - For some reason, the system seems to be killing the postgresql
processes before it has a chance to clean up.

2 - When it tries to recover from this, it ends up messing up the
indexes for some reason.

Any pointers? Is there anything I can do for Windows to actually wait
for the postgresql processes to clean up?

> Do you have a "pre-packaged" test case that induces failures most of the
> time or something like that?

As pointed above, it seems the problem is related to shuting down the
machine.

> Are these machines all basically make/model/brand/chipset etc... of
> machines?  Then it might be a buggy driver or something.

Nope. Completelly different. But all running Windows 2000 or XP and
running PostgreSQL 8.0.1.

-Bruno

**********************************************************************
Informação transmitida destina-se apenas à pessoa a quem foi endereçada e pode conter informação confidencial,
legalmenteprotegida e para conhecimento exclusivo do destinatário. Se o leitor desta advertência não for o seu
destinatário,fica ciente de que sua leitura, divulgação ou cópia é estritamente proibida. Caso a mensagem tenha sido
recebidapor engano, favor comunicar ao remetente e apagar o texto de qualquer computador. 


The information transmitted is intended only for the person or entity to which it is addressed and may contain
confidentialand/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any
actionin reliance upon this information, by person or entity other than the intended recipient is prohibited. If you
receivedthis in error, please contact the sender and delete the material from any computer. 
**********************************************************************

Re: Indexes getting corrupted.

From
Tom Lane
Date:
"Bruno G. Albuquerque" <balbuquerque@dba.com.br> writes:
> Any pointers? Is there anything I can do for Windows to actually wait
> for the postgresql processes to clean up?

What it looks like to me is that your disk drive is lying about what
it's actually written; or possibly Windows is lying to Postgres about
what it's actually written.  The former is pretty common and can usually
be fixed by turning off write caching.

            regards, tom lane

Re: Indexes getting corrupted.

From
Simon Riggs
Date:
On Fri, 2005-06-10 at 12:57 -0400, Tom Lane wrote:
> "Bruno G. Albuquerque" <balbuquerque@dba.com.br> writes:
> > Any pointers? Is there anything I can do for Windows to actually wait
> > for the postgresql processes to clean up?
>
> What it looks like to me is that your disk drive is lying about what
> it's actually written; or possibly Windows is lying to Postgres about
> what it's actually written.  The former is pretty common and can usually
> be fixed by turning off write caching.

Possibly. But why would this effect only indexes?

This is the same error that was also reported on GENERAL recently on
8.0.1.

I'm suspicious of a more subtle intermittent error. We have no
information about what the magic values are, only that they are not
correct. Should we increase the information returned for that error?
That might show up an error cause.

Best Regards, Simon Riggs

Attachment

Re: Indexes getting corrupted.

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> I'm suspicious of a more subtle intermittent error.

Yeah, I am too, but so far none of the reporters have been cooperative
about providing more information :-(

> We have no
> information about what the magic values are, only that they are not
> correct. Should we increase the information returned for that error?

I think the proposed patch is a waste of time.  What I am hoping to get
from people is a dump of the whole first page of the corrupted index
(via pg_filedump, or even good ol' od).  That might give us some idea of
what we are dealing with --- localized corruption in a basically good
metapage, or wholesale replacement of the page with some other page (and
if so what), or maybe it is a hardware fault after all.  You can't draw
those sorts of conclusions from one or two words, but with a whole page
to look at you have a shot at telling the difference.

            regards, tom lane

Re: Indexes getting corrupted.

From
"Bruno G. Albuquerque"
Date:
Tom Lane wrote:
>>I'm suspicious of a more subtle intermittent error.
>
> Yeah, I am too, but so far none of the reporters have been cooperative
> about providing more information :-(

Well, at least from me, I can tell you I gave all information I came up
with. One thing that I pointed out is this (Windows 2000):

When you shutdown the machine just after you run my customized
installation procedure and you check the log after rebooting, you will
notice that only the first message (saying that a fast shutdown request
was received) will be in the log. The other messages that mention the
shutdown process completed succesfully do not appear. If I check the
database, the indexes *WILL* be corrupt.

I solved it by automatically stopping the server after the installation
precedure concludes. This solves this problem and, even after rebooting,
the indexes seems to be ok.

> I think the proposed patch is a waste of time.  What I am hoping to get
> from people is a dump of the whole first page of the corrupted index
> (via pg_filedump, or even good ol' od).  That might give us some idea of
> what we are dealing with --- localized corruption in a basically good
> metapage, or wholesale replacement of the page with some other page (and
> if so what), or maybe it is a hardware fault after all.  You can't draw
> those sorts of conclusions from one or two words, but with a whole page
> to look at you have a shot at telling the difference.

I can try to revert my changes to get the problem to happen again. Just
tell me what exactly you need me to do. Also, this problem happened on
*SEVERAL* machines with different configurations. Actually, it happened
with all machines I tried and the "fix" I implemented (as mentioned
above) worked on all of them too.

-Bruno

**********************************************************************
Informação transmitida destina-se apenas à pessoa a quem foi endereçada e pode conter informação confidencial,
legalmenteprotegida e para conhecimento exclusivo do destinatário. Se o leitor desta advertência não for o seu
destinatário,fica ciente de que sua leitura, divulgação ou cópia é estritamente proibida. Caso a mensagem tenha sido
recebidapor engano, favor comunicar ao remetente e apagar o texto de qualquer computador. 


The information transmitted is intended only for the person or entity to which it is addressed and may contain
confidentialand/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any
actionin reliance upon this information, by person or entity other than the intended recipient is prohibited. If you
receivedthis in error, please contact the sender and delete the material from any computer. 
**********************************************************************

Re: Indexes getting corrupted.

From
Simon Riggs
Date:
On Thu, 2005-06-16 at 08:25 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I'm suspicious of a more subtle intermittent error.
>
> Yeah, I am too, but so far none of the reporters have been cooperative
> about providing more information :-(
>
> > We have no
> > information about what the magic values are, only that they are not
> > correct. Should we increase the information returned for that error?
>
> I think the proposed patch is a waste of time.  What I am hoping to get
> from people is a dump of the whole first page of the corrupted index
> (via pg_filedump, or even good ol' od).  That might give us some idea of
> what we are dealing with --- localized corruption in a basically good
> metapage, or wholesale replacement of the page with some other page (and
> if so what), or maybe it is a hardware fault after all.  You can't draw
> those sorts of conclusions from one or two words, but with a whole page
> to look at you have a shot at telling the difference.

Sorry, let me explain. I immediately assumed that all problem reporters
had not provided further information because they could not, not because
they would not.

The patch is trivial and your suggested methods of looking into this are
much preferable. However, that relies on somebody being able to locate
the metapage, even assuming they don't overwrite the whole thing and
continue. The bug is intermittent and so re-running will allow work to
continue, so it would seem that is exactly what they do.

My patch would allow some more information to be retrieved in the
meantime, while we hope for somebody to upload a damaged metapage. If
the patch is not sufficient, then I'd suggest that the metapage be
dumped to log so it can be more easily provided for inspection.

In general, such matters are usually hardware faults. But all reporters
of this error in 8.0 have only reported index errors and no others,
which leads me to suspect a software related cause.

Best Regards, Simon Riggs