Thread: Indexes getting corrupted.
I am having a weird problem here. I have the automated process to install PostgreSQL (8.0.1) on Windows 2000 machines. Besides installing the database server itself, my process does the following: 1 - Runs initdb to created the database I will be using. 2 - Runs a SQL script that creates the entire database and populates some of the tables. 3 - Runs another script that populates other tables (using COPY). What happens is that, sometimes, after the entire process run, I end up with a database with virtually all indexes corrupt! Try to use the indexes for anything results in "index is not a btree" error. This happens randomly and I can get it fixed sometimes by simply erasin everything and starting over. It happened in more thna one machine, if that matters. Also, changing the database from one partition to another almost all the time solves it (all machines have 2 partitions). Any ideas about what can be causing this? Manually doing "reindex index" gets the indexes sorted. Is there a way to automatically reindex all indexes in one go? that would at least help me a bit. Thanks in advance. -Bruno ********************************************************************** Informação transmitida destina-se apenas à pessoa a quem foi endereçada e pode conter informação confidencial, legalmenteprotegida e para conhecimento exclusivo do destinatário. Se o leitor desta advertência não for o seu destinatário,fica ciente de que sua leitura, divulgação ou cópia é estritamente proibida. Caso a mensagem tenha sido recebidapor engano, favor comunicar ao remetente e apagar o texto de qualquer computador. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidentialand/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any actionin reliance upon this information, by person or entity other than the intended recipient is prohibited. If you receivedthis in error, please contact the sender and delete the material from any computer. **********************************************************************
Reindexdb in contrib should help with rebuilding indexes. Bruno G. Albuquerque wrote: > > I am having a weird problem here. I have the automated process to > install PostgreSQL (8.0.1) on Windows 2000 machines. Besides installing > the database server itself, my process does the following: > > 1 - Runs initdb to created the database I will be using. > 2 - Runs a SQL script that creates the entire database and populates > some of the tables. > 3 - Runs another script that populates other tables (using COPY). > > What happens is that, sometimes, after the entire process run, I end up > with a database with virtually all indexes corrupt! Try to use the > indexes for anything results in "index is not a btree" error. This > happens randomly and I can get it fixed sometimes by simply erasin > everything and starting over. > > It happened in more thna one machine, if that matters. Also, changing > the database from one partition to another almost all the time solves it > (all machines have 2 partitions). > > Any ideas about what can be causing this? Manually doing "reindex index" > gets the indexes sorted. Is there a way to automatically reindex all > indexes in one go? that would at least help me a bit. > > Thanks in advance. > > -Bruno > > ********************************************************************** > Informação transmitida destina-se apenas à pessoa a quem foi endereçada > e pode conter informação confidencial, legalmente protegida e para > conhecimento exclusivo do destinatário. Se o leitor desta advertência > não for o seu destinatário, fica ciente de que sua leitura, divulgação > ou cópia é estritamente proibida. Caso a mensagem tenha sido recebida > por engano, favor comunicar ao remetente e apagar o texto de qualquer > computador. > > > The information transmitted is intended only for the person or entity to > which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon this information, by person or > entity other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from > any computer. > ********************************************************************** > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > > >
On Wed, 2005-06-08 at 13:39, Bruno G. Albuquerque wrote: > I am having a weird problem here. I have the automated process to > install PostgreSQL (8.0.1) on Windows 2000 machines. Besides installing > the database server itself, my process does the following: > > 1 - Runs initdb to created the database I will be using. > 2 - Runs a SQL script that creates the entire database and populates > some of the tables. > 3 - Runs another script that populates other tables (using COPY). > > What happens is that, sometimes, after the entire process run, I end up > with a database with virtually all indexes corrupt! Try to use the > indexes for anything results in "index is not a btree" error. This > happens randomly and I can get it fixed sometimes by simply erasin > everything and starting over. Note that such behaviour is normally associated with bad hardware (RAM, hard drive, controller, CPU, you name it...) But since the windows port is pretty new, you may have found some corner case. Do you have a "pre-packaged" test case that induces failures most of the time or something like that? Are these machines all basically make/model/brand/chipset etc... of machines? Then it might be a buggy driver or something.
Scott Marlowe wrote: >>I am having a weird problem here. I have the automated process to >>install PostgreSQL (8.0.1) on Windows 2000 machines. Besides installing >>the database server itself, my process does the following: >> >>1 - Runs initdb to created the database I will be using. >>2 - Runs a SQL script that creates the entire database and populates >>some of the tables. >>3 - Runs another script that populates other tables (using COPY). >> >>What happens is that, sometimes, after the entire process run, I end up >>with a database with virtually all indexes corrupt! Try to use the >>indexes for anything results in "index is not a btree" error. This >>happens randomly and I can get it fixed sometimes by simply erasin >>everything and starting over. > > > Note that such behaviour is normally associated with bad hardware (RAM, > hard drive, controller, CPU, you name it...) But since the windows port > is pretty new, you may have found some corner case. I think I figured it out. The problem is related to rebooting the machine. Here are the results of my testing: - Stopping the service before rebooting (log file contents) [...] 2005-06-10 11:44:40 LOG: received fast shutdown request 2005-06-10 11:44:40 LOG: shutting down 2005-06-10 11:44:40 LOG: database system is shut down 2005-06-10 11:44:41 LOG: logger shutting down - After rebooting: 2005-06-10 11:47:25 LOG: database system was shut down at 2005-06-10 11:44:40 Horário padrão CAIXA -3 2005-06-10 11:47:25 LOG: checkpoint record is at 0/16D35A8 2005-06-10 11:47:25 LOG: redo record is at 0/16D35A8; undo record is at 0/0; shutdown TRUE 2005-06-10 11:47:25 LOG: next transaction ID: 1487; next OID: 64574 2005-06-10 11:47:26 LOG: database system is ready - Rebooting the machine without stopping the service: [...] 2005-06-10 11:39:30 LOG: received fast shutdown request - After rebooting: 2005-06-10 11:41:46 LOG: database system was interrupted at 2005-06-10 11:37:22 Horário padrão CAIXA -3 2005-06-10 11:41:46 LOG: checkpoint record is at 0/AC30F0 2005-06-10 11:41:46 LOG: redo record is at 0/AC30F0; undo record is at 0/0; shutdown TRUE 2005-06-10 11:41:46 LOG: next transaction ID: 546; next OID: 17230 2005-06-10 11:41:46 LOG: database system was not properly shut down; automatic recovery in progress 2005-06-10 11:41:47 LOG: redo starts at 0/AC3130 2005-06-10 11:42:09 LOG: record with zero length at 0/16C8698 2005-06-10 11:42:09 LOG: redo done at 0/16C8668 2005-06-10 11:42:15 LOG: database system is ready 2005-06-10 11:42:38 ERROR: index "pk_cubtb021_localidade" is not a btree 2005-06-10 11:43:08 ERROR: index "pk_cubtb021_localidade" is not a btree 2005-06-10 11:43:09 ERROR: index "pk_cubtb038_ocupacao" is not a btree The btree errors are the ones I am getting when running the client application that connets to this database. So, it seems that 2 things are happening: 1 - For some reason, the system seems to be killing the postgresql processes before it has a chance to clean up. 2 - When it tries to recover from this, it ends up messing up the indexes for some reason. Any pointers? Is there anything I can do for Windows to actually wait for the postgresql processes to clean up? > Do you have a "pre-packaged" test case that induces failures most of the > time or something like that? As pointed above, it seems the problem is related to shuting down the machine. > Are these machines all basically make/model/brand/chipset etc... of > machines? Then it might be a buggy driver or something. Nope. Completelly different. But all running Windows 2000 or XP and running PostgreSQL 8.0.1. -Bruno ********************************************************************** Informação transmitida destina-se apenas à pessoa a quem foi endereçada e pode conter informação confidencial, legalmenteprotegida e para conhecimento exclusivo do destinatário. Se o leitor desta advertência não for o seu destinatário,fica ciente de que sua leitura, divulgação ou cópia é estritamente proibida. Caso a mensagem tenha sido recebidapor engano, favor comunicar ao remetente e apagar o texto de qualquer computador. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidentialand/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any actionin reliance upon this information, by person or entity other than the intended recipient is prohibited. If you receivedthis in error, please contact the sender and delete the material from any computer. **********************************************************************
"Bruno G. Albuquerque" <balbuquerque@dba.com.br> writes: > Any pointers? Is there anything I can do for Windows to actually wait > for the postgresql processes to clean up? What it looks like to me is that your disk drive is lying about what it's actually written; or possibly Windows is lying to Postgres about what it's actually written. The former is pretty common and can usually be fixed by turning off write caching. regards, tom lane
On Fri, 2005-06-10 at 12:57 -0400, Tom Lane wrote: > "Bruno G. Albuquerque" <balbuquerque@dba.com.br> writes: > > Any pointers? Is there anything I can do for Windows to actually wait > > for the postgresql processes to clean up? > > What it looks like to me is that your disk drive is lying about what > it's actually written; or possibly Windows is lying to Postgres about > what it's actually written. The former is pretty common and can usually > be fixed by turning off write caching. Possibly. But why would this effect only indexes? This is the same error that was also reported on GENERAL recently on 8.0.1. I'm suspicious of a more subtle intermittent error. We have no information about what the magic values are, only that they are not correct. Should we increase the information returned for that error? That might show up an error cause. Best Regards, Simon Riggs
Attachment
Simon Riggs <simon@2ndquadrant.com> writes: > I'm suspicious of a more subtle intermittent error. Yeah, I am too, but so far none of the reporters have been cooperative about providing more information :-( > We have no > information about what the magic values are, only that they are not > correct. Should we increase the information returned for that error? I think the proposed patch is a waste of time. What I am hoping to get from people is a dump of the whole first page of the corrupted index (via pg_filedump, or even good ol' od). That might give us some idea of what we are dealing with --- localized corruption in a basically good metapage, or wholesale replacement of the page with some other page (and if so what), or maybe it is a hardware fault after all. You can't draw those sorts of conclusions from one or two words, but with a whole page to look at you have a shot at telling the difference. regards, tom lane
Tom Lane wrote: >>I'm suspicious of a more subtle intermittent error. > > Yeah, I am too, but so far none of the reporters have been cooperative > about providing more information :-( Well, at least from me, I can tell you I gave all information I came up with. One thing that I pointed out is this (Windows 2000): When you shutdown the machine just after you run my customized installation procedure and you check the log after rebooting, you will notice that only the first message (saying that a fast shutdown request was received) will be in the log. The other messages that mention the shutdown process completed succesfully do not appear. If I check the database, the indexes *WILL* be corrupt. I solved it by automatically stopping the server after the installation precedure concludes. This solves this problem and, even after rebooting, the indexes seems to be ok. > I think the proposed patch is a waste of time. What I am hoping to get > from people is a dump of the whole first page of the corrupted index > (via pg_filedump, or even good ol' od). That might give us some idea of > what we are dealing with --- localized corruption in a basically good > metapage, or wholesale replacement of the page with some other page (and > if so what), or maybe it is a hardware fault after all. You can't draw > those sorts of conclusions from one or two words, but with a whole page > to look at you have a shot at telling the difference. I can try to revert my changes to get the problem to happen again. Just tell me what exactly you need me to do. Also, this problem happened on *SEVERAL* machines with different configurations. Actually, it happened with all machines I tried and the "fix" I implemented (as mentioned above) worked on all of them too. -Bruno ********************************************************************** Informação transmitida destina-se apenas à pessoa a quem foi endereçada e pode conter informação confidencial, legalmenteprotegida e para conhecimento exclusivo do destinatário. Se o leitor desta advertência não for o seu destinatário,fica ciente de que sua leitura, divulgação ou cópia é estritamente proibida. Caso a mensagem tenha sido recebidapor engano, favor comunicar ao remetente e apagar o texto de qualquer computador. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidentialand/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any actionin reliance upon this information, by person or entity other than the intended recipient is prohibited. If you receivedthis in error, please contact the sender and delete the material from any computer. **********************************************************************
On Thu, 2005-06-16 at 08:25 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I'm suspicious of a more subtle intermittent error. > > Yeah, I am too, but so far none of the reporters have been cooperative > about providing more information :-( > > > We have no > > information about what the magic values are, only that they are not > > correct. Should we increase the information returned for that error? > > I think the proposed patch is a waste of time. What I am hoping to get > from people is a dump of the whole first page of the corrupted index > (via pg_filedump, or even good ol' od). That might give us some idea of > what we are dealing with --- localized corruption in a basically good > metapage, or wholesale replacement of the page with some other page (and > if so what), or maybe it is a hardware fault after all. You can't draw > those sorts of conclusions from one or two words, but with a whole page > to look at you have a shot at telling the difference. Sorry, let me explain. I immediately assumed that all problem reporters had not provided further information because they could not, not because they would not. The patch is trivial and your suggested methods of looking into this are much preferable. However, that relies on somebody being able to locate the metapage, even assuming they don't overwrite the whole thing and continue. The bug is intermittent and so re-running will allow work to continue, so it would seem that is exactly what they do. My patch would allow some more information to be retrieved in the meantime, while we hope for somebody to upload a damaged metapage. If the patch is not sufficient, then I'd suggest that the metapage be dumped to log so it can be more easily provided for inspection. In general, such matters are usually hardware faults. But all reporters of this error in 8.0 have only reported index errors and no others, which leads me to suspect a software related cause. Best Regards, Simon Riggs