Thread: Corruption of files in PostgreSQL
Hi everyone, my name is Paolo Bizzarri and I am a developer of PAFlow, an document tracking and management system for public administrations. We use postgres as a backend, and we are experimenting some corruption problems on openoffice files. As our application is rather complex (it includes Zope as an application server, OpenOffice as a document server and as a client) we need some info on how to check that we are interacting correctly with Postgres. Do you have any hints on how what is useful to check/see? We are currently using: - PostgreSQL 7.4.8; - pyscopg 1.1.11; - Zope 2.7.x; - Openoffice 2.2. Best regards. Paolo Bizzarri
Paolo Bizzarri wrote: > We use postgres as a backend, and we are experimenting some corruption > problems on openoffice files. 1. How are you storing these files? 2. What is the nature of the corruption? > As our application is rather complex (it includes Zope as an > application server, OpenOffice as a document server and as a client) > we need some info on how to check that we are interacting correctly > with Postgres. Shouldn't matter. > We are currently using: > > - PostgreSQL 7.4.8; Well, you need to upgrade this - version 7.4.17 is the latest in the 7.4 series. You are missing 9 separate batches of bug and security fixes. > - pyscopg 1.1.11; > - Zope 2.7.x; > - Openoffice 2.2. None of this should matter really, unless there's some subtle bug in psycopg causing corruption of data in-transit. Let's get some details on the two questions above and see if there's a pattern to your problems. -- Richard Huxton Archonet Ltd
On 5/30/07, Richard Huxton <dev@archonet.com> wrote: > Paolo Bizzarri wrote: > > We use postgres as a backend, and we are experimenting some corruption > > problems on openoffice files. > > 1. How are you storing these files? Files are stored as large objects. They are written with an lo_write and its contents is passed as a Binary object. > 2. What is the nature of the corruption? Apparently, files get truncated. > > As our application is rather complex (it includes Zope as an > > application server, OpenOffice as a document server and as a client) > > we need some info on how to check that we are interacting correctly > > with Postgres. > > Shouldn't matter. I hope so... > > We are currently using: > > > > - PostgreSQL 7.4.8; > > Well, you need to upgrade this - version 7.4.17 is the latest in the 7.4 > series. You are missing 9 separate batches of bug and security fixes. Ok. We will upgrade and see if this can help solve the problem. > > > - pyscopg 1.1.11; > > - Zope 2.7.x; > > - Openoffice 2.2. > > None of this should matter really, unless there's some subtle bug in > psycopg causing corruption of data in-transit. > > Let's get some details on the two questions above and see if there's a > pattern to your problems. Ok. Thank you. Paolo Bizzarri Icube S.r.l.
Paolo Bizzarri wrote: > On 5/30/07, Richard Huxton <dev@archonet.com> wrote: >> Paolo Bizzarri wrote: >> > We use postgres as a backend, and we are experimenting some corruption >> > problems on openoffice files. >> >> 1. How are you storing these files? > > Files are stored as large objects. They are written with an lo_write > and its contents is passed as a Binary object. > >> 2. What is the nature of the corruption? > > Apparently, files get truncated. Interesting. You might want to read this recent thread (ongoing): http://archives.postgresql.org/pgsql-general/2007-05/msg00734.php -- Richard Huxton Archonet Ltd
Paolo Bizzarri,
I am also using postgresql in my application and also facing file object corruption problem.
I already discussed several times with Richard Huxton, and ended without any clue.
Here I am briefing my problem, see if u find any clue about it.
I am storing/retrieving my file in postgresql using lo_export() and lo_import() api.
after few weeks (as application is being used - number of file objects in database also grows) my file object gets corrupted. And I have no clue about which causes this problem.
I confirmed the file corruption by the following query,
sfrs2=> select loid, pageno, length(data) from pg_largeobject where loid = 101177 and pageno = 630;
loid | pageno | length
--------+--------+--------
101177 | 630 | 181
(1 row)
But actually the result of the above query before corruption(ie, immediately after file object added to table)
fasp_test=> select loid, pageno, length(data) from pg_largeobject where loid = 106310 and pageno = 630;
loid | pageno | length
--------+--------+--------
106310 | 630 | 205
(1 row)
I uploaded same file in both(sfrs2, fasp_test) databases. The first one result is after the corruption. and the later is before corruption.
You also confirm you problem like this. And I strongly believe that, there is some bug in PostgreSQL.
Kindly don't forget to alert me once u find solution/cause.
Regards,
Purusothaman A
--
http://PurusothamanA.wordpress.com/
I am also using postgresql in my application and also facing file object corruption problem.
I already discussed several times with Richard Huxton, and ended without any clue.
Here I am briefing my problem, see if u find any clue about it.
I am storing/retrieving my file in postgresql using lo_export() and lo_import() api.
after few weeks (as application is being used - number of file objects in database also grows) my file object gets corrupted. And I have no clue about which causes this problem.
I confirmed the file corruption by the following query,
sfrs2=> select loid, pageno, length(data) from pg_largeobject where loid = 101177 and pageno = 630;
loid | pageno | length
--------+--------+--------
101177 | 630 | 181
(1 row)
But actually the result of the above query before corruption(ie, immediately after file object added to table)
fasp_test=> select loid, pageno, length(data) from pg_largeobject where loid = 106310 and pageno = 630;
loid | pageno | length
--------+--------+--------
106310 | 630 | 205
(1 row)
I uploaded same file in both(sfrs2, fasp_test) databases. The first one result is after the corruption. and the later is before corruption.
You also confirm you problem like this. And I strongly believe that, there is some bug in PostgreSQL.
Kindly don't forget to alert me once u find solution/cause.
Regards,
Purusothaman A
On 5/30/07, Paolo Bizzarri <pibizza@gmail.com> wrote:
On 5/30/07, Richard Huxton <dev@archonet.com> wrote:
> Paolo Bizzarri wrote:
> > We use postgres as a backend, and we are experimenting some corruption
> > problems on openoffice files.
>
> 1. How are you storing these files?
Files are stored as large objects. They are written with an lo_write
and its contents is passed as a Binary object.
> 2. What is the nature of the corruption?
Apparently, files get truncated.
> > As our application is rather complex (it includes Zope as an
> > application server, OpenOffice as a document server and as a client)
> > we need some info on how to check that we are interacting correctly
> > with Postgres.
>
> Shouldn't matter.
I hope so...
> > We are currently using:
> >
> > - PostgreSQL 7.4.8;
>
> Well, you need to upgrade this - version 7.4.17 is the latest in the 7.4
> series. You are missing 9 separate batches of bug and security fixes.
Ok. We will upgrade and see if this can help solve the problem.
>
> > - pyscopg 1.1.11 ;
> > - Zope 2.7.x;
> > - Openoffice 2.2.
>
> None of this should matter really, unless there's some subtle bug in
> psycopg causing corruption of data in-transit.
>
> Let's get some details on the two questions above and see if there's a
> pattern to your problems.
Ok. Thank you.
Paolo Bizzarri
Icube S.r.l.
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
--
http://PurusothamanA.wordpress.com/
Hi everyone, a little update. We have upgraded our system to 7.4.17. The problem of truncated files seems now better, but it is still present. We have not found a clearly understandable pattern on why this happens. Just to provide some further information: - we create a file and store on the DB; - we give the file to the user, and he can modify at its wish the file; - we store back the modified file on the DB; - the last two points can happen several times. Any hint? Best regards. Paolo Bizzarri Icube S.r.l. On 5/30/07, Purusothaman A <purusothaman.a@gmail.com> wrote: > Paolo Bizzarri, > > I am also using postgresql in my application and also facing file object > corruption problem. > > I already discussed several times with Richard Huxton, and ended without any > clue. > > Here I am briefing my problem, see if u find any clue about it. > I am storing/retrieving my file in postgresql using lo_export() and > lo_import() api. > > after few weeks (as application is being used - number of file objects in > database also grows) my file object gets corrupted. And I have no clue about > which causes this problem. > > I confirmed the file corruption by the following query, > > sfrs2=> select loid, pageno, length(data) from pg_largeobject where loid = > 101177 and pageno = 630; > loid | pageno | length > --------+--------+-------- > 101177 | 630 | 181 > (1 row) > > But actually the result of the above query before corruption(ie, immediately > after file object added to table) > > fasp_test=> select loid, pageno, length(data) from pg_largeobject where loid > = 106310 and pageno = 630; > loid | pageno | length > --------+--------+-------- > 106310 | 630 | 205 > (1 row) > > I uploaded same file in both(sfrs2, fasp_test) databases. The first one > result is after the corruption. and the later is before corruption. > > You also confirm you problem like this. And I strongly believe that, there > is some bug in PostgreSQL. > > Kindly don't forget to alert me once u find solution/cause. > > Regards, > Purusothaman A > > > On 5/30/07, Paolo Bizzarri <pibizza@gmail.com> wrote: > > > > On 5/30/07, Richard Huxton <dev@archonet.com> wrote: > > > Paolo Bizzarri wrote: > > > > We use postgres as a backend, and we are experimenting some corruption > > > > problems on openoffice files. > > > > > > 1. How are you storing these files? > > > > Files are stored as large objects. They are written with an lo_write > > and its contents is passed as a Binary object. > > > > > 2. What is the nature of the corruption? > > > > Apparently, files get truncated. > > > > > > As our application is rather complex (it includes Zope as an > > > > application server, OpenOffice as a document server and as a client) > > > > we need some info on how to check that we are interacting correctly > > > > with Postgres. > > > > > > Shouldn't matter. > > > > I hope so... > > > > > > We are currently using: > > > > > > > > - PostgreSQL 7.4.8; > > > > > > Well, you need to upgrade this - version 7.4.17 is the latest in the 7.4 > > > series. You are missing 9 separate batches of bug and security fixes. > > > > Ok. We will upgrade and see if this can help solve the problem. > > > > > > > > > - pyscopg 1.1.11 ; > > > > - Zope 2.7.x; > > > > - Openoffice 2.2. > > > > > > None of this should matter really, unless there's some subtle bug in > > > psycopg causing corruption of data in-transit. > > > > > > Let's get some details on the two questions above and see if there's a > > > pattern to your problems. > > > > Ok. Thank you. > > > > Paolo Bizzarri > > Icube S.r.l. > > > > ---------------------------(end of > broadcast)--------------------------- > > TIP 5: don't forget to increase your free space map settings > > > > > > -- > http://PurusothamanA.wordpress.com/
"Paolo Bizzarri" <pibizza@gmail.com> writes: > Any hint? Please provide a reproducible test case ... regards, tom lane
Hi Tom, as explained above, the problem seems quite random. So I need to understand what we have to check. Best regards. Paolo Bizzarri Icube S.r.l. On 6/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Paolo Bizzarri" <pibizza@gmail.com> writes: > > Any hint? > > Please provide a reproducible test case ... > > regards, tom lane >
Hi Tom Lane,
In my case, we upload/download files to/from postgresql.
And we don't change the content of the file after once loaded to postgresql.
But as days going, more files stored to postgresql and never change file content after that.
But download many times the stored files as per need.
What I am guessing from my situation is, postgresql is crossing boundaries of the file objects while accessing file objects. Because we simply use 2 api, lo_export() and lo_import(), for storing files and retriving the stored files, and never attempt to alter its contents.
And I am never able find any pattern to corrupt the file objects.
Regards,
Purusothaman A
In my case, we upload/download files to/from postgresql.
And we don't change the content of the file after once loaded to postgresql.
But as days going, more files stored to postgresql and never change file content after that.
But download many times the stored files as per need.
What I am guessing from my situation is, postgresql is crossing boundaries of the file objects while accessing file objects. Because we simply use 2 api, lo_export() and lo_import(), for storing files and retriving the stored files, and never attempt to alter its contents.
And I am never able find any pattern to corrupt the file objects.
Regards,
Purusothaman A
On 6/2/07, Tom Lane < tgl@sss.pgh.pa.us> wrote:
"Paolo Bizzarri" < pibizza@gmail.com> writes:
> Any hint?
Please provide a reproducible test case ...
regards, tom lane
"Paolo Bizzarri" <pibizza@gmail.com> writes: > On 6/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Please provide a reproducible test case ... > as explained above, the problem seems quite random. So I need to > understand what we have to check. In this context "reproducible" means that the failure happens eventually. I don't care if the test program only fails once in thousands of tries --- I just want a complete self-contained example that produces a failure. I don't have the time to try to reverse-engineer a test case from your rather vague description, whereas I suppose you can make one by stripping down code you've already got. The sub-text here is that I don't really believe that lo_import and lo_export in themselves are broken. There must be some extra factor --- something else you are doing, or something in your environment --- contributing to the bug. Thus, the odds of someone else building a usable test case from scratch aren't that good, and being able to reproduce the failure outside your environment is an essential step. regards, tom lane
On 6/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Paolo Bizzarri" <pibizza@gmail.com> writes: > > On 6/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Please provide a reproducible test case ... > > > as explained above, the problem seems quite random. So I need to > > understand what we have to check. > > In this context "reproducible" means that the failure happens > eventually. I don't care if the test program only fails once in > thousands of tries --- I just want a complete self-contained example > that produces a failure. As said above, our application is rather complex and involves several different pieces of software, including Zope, OpenOffice both as server and client, and PostgreSQL. We are absolutely NOT sure that the problem is inside PostgreSQL. What we are trying to understand is, first and foremost, if there are known cases under which PostgreSQL can truncate a file. > I don't have the time to try to > reverse-engineer a test case from your rather vague description, whereas > I suppose you can make one by stripping down code you've already got. I was not asking for a reverse engineering of a test case. I will try to provide an example, but the problem is, without knowing what to see, that I could omit fundamental details. > The sub-text here is that I don't really believe that lo_import and > lo_export in themselves are broken. There must be some extra factor --- > something else you are doing, or something in your environment --- > contributing to the bug. I certainly agree with you. I was asking what to see and what to check. > Thus, the odds of someone else building a > usable test case from scratch aren't that good, and being able to > reproduce the failure outside your environment is an essential step. I agree with you. I was not hoping for this. At the same time, I was asking an help for what to see, so that I can reproduce a test case. As an alternate, I can suggest to download and install PAFlow, but I understand it is a rather large application.... Best regards. Paolo Bizzarri Icube S.r.l.
On 6/2/07, Paolo Bizzarri <pibizza@gmail.com> wrote:
I think it's somewhat more likely that whatever is sending the file to PG is the cause, either in how it handles the file or due to communications issues.
This sounds similar to a problem I experienced with an application I wrote that takes files extracted from email (using MHonarc) and stores them in a PG database so that I can render them using a web browser later on. I wound up having to store the files in BASE64 encoding to keep them from getting corrupted.
--
Mike Nolan
What we are trying to understand is, first and foremost, if there are
known cases under which PostgreSQL can truncate a file.
I think it's somewhat more likely that whatever is sending the file to PG is the cause, either in how it handles the file or due to communications issues.
This sounds similar to a problem I experienced with an application I wrote that takes files extracted from email (using MHonarc) and stores them in a PG database so that I can render them using a web browser later on. I wound up having to store the files in BASE64 encoding to keep them from getting corrupted.
--
Mike Nolan
I don't use lo_import and lo_export myself, but is there any way to log their usage? It certainly sounds as though step 1 for this user is to keep track of how much data is handed to PG for each file, and how much data is returned to PG for each file (and how much data is in the file at the time of the request). -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
What OS are you running ? Linux(32 or 64 Bit)? Ext 3 Filesystem ? Wich Kernel Version ? Bug in Ext 3/Linux Kernel/Hardware(Raid Controller ?) ? Does the error only happens under heavy load ? regards, -Franz -----Ursprüngliche Nachricht----- Von: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von Paolo Bizzarri Gesendet: Samstag, 2. Juni 2007 07:46 An: Purusothaman A Cc: Richard Huxton; pgsql-general@postgresql.org Betreff: Re: [GENERAL] Corruption of files in PostgreSQL Hi everyone, a little update. We have upgraded our system to 7.4.17. The problem of truncated files seems now better, but it is still present. We have not found a clearly understandable pattern on why this happens. Just to provide some further information: - we create a file and store on the DB; - we give the file to the user, and he can modify at its wish the file; - we store back the modified file on the DB; - the last two points can happen several times. Any hint? Best regards. Paolo Bizzarri Icube S.r.l. On 5/30/07, Purusothaman A <purusothaman.a@gmail.com> wrote: > Paolo Bizzarri, > > I am also using postgresql in my application and also facing file object > corruption problem. > > I already discussed several times with Richard Huxton, and ended without any > clue. > > Here I am briefing my problem, see if u find any clue about it. > I am storing/retrieving my file in postgresql using lo_export() and > lo_import() api. > > after few weeks (as application is being used - number of file objects in > database also grows) my file object gets corrupted. And I have no clue about > which causes this problem. > > I confirmed the file corruption by the following query, > > sfrs2=> select loid, pageno, length(data) from pg_largeobject where loid = > 101177 and pageno = 630; > loid | pageno | length > --------+--------+-------- > 101177 | 630 | 181 > (1 row) > > But actually the result of the above query before corruption(ie, immediately > after file object added to table) > > fasp_test=> select loid, pageno, length(data) from pg_largeobject where loid > = 106310 and pageno = 630; > loid | pageno | length > --------+--------+-------- > 106310 | 630 | 205 > (1 row) > > I uploaded same file in both(sfrs2, fasp_test) databases. The first one > result is after the corruption. and the later is before corruption. > > You also confirm you problem like this. And I strongly believe that, there > is some bug in PostgreSQL. > > Kindly don't forget to alert me once u find solution/cause. > > Regards, > Purusothaman A > > > On 5/30/07, Paolo Bizzarri <pibizza@gmail.com> wrote: > > > > On 5/30/07, Richard Huxton <dev@archonet.com> wrote: > > > Paolo Bizzarri wrote: > > > > We use postgres as a backend, and we are experimenting some corruption > > > > problems on openoffice files. > > > > > > 1. How are you storing these files? > > > > Files are stored as large objects. They are written with an lo_write > > and its contents is passed as a Binary object. > > > > > 2. What is the nature of the corruption? > > > > Apparently, files get truncated. > > > > > > As our application is rather complex (it includes Zope as an > > > > application server, OpenOffice as a document server and as a client) > > > > we need some info on how to check that we are interacting correctly > > > > with Postgres. > > > > > > Shouldn't matter. > > > > I hope so... > > > > > > We are currently using: > > > > > > > > - PostgreSQL 7.4.8; > > > > > > Well, you need to upgrade this - version 7.4.17 is the latest in the 7.4 > > > series. You are missing 9 separate batches of bug and security fixes. > > > > Ok. We will upgrade and see if this can help solve the problem. > > > > > > > > > - pyscopg 1.1.11 ; > > > > - Zope 2.7.x; > > > > - Openoffice 2.2. > > > > > > None of this should matter really, unless there's some subtle bug in > > > psycopg causing corruption of data in-transit. > > > > > > Let's get some details on the two questions above and see if there's a > > > pattern to your problems. > > > > Ok. Thank you. > > > > Paolo Bizzarri > > Icube S.r.l. > > > > ---------------------------(end of > broadcast)--------------------------- > > TIP 5: don't forget to increase your free space map settings > > > > > > -- > http://PurusothamanA.wordpress.com/ ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
If there is any database driver (which was bild with the old postgresql sources/libs), (re)build this driver with the new postgresql sources/libs. Greetings, -Franz -----Ursprüngliche Nachricht----- Von: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von Paolo Bizzarri Gesendet: Samstag, 2. Juni 2007 07:46 An: Purusothaman A Cc: Richard Huxton; pgsql-general@postgresql.org Betreff: Re: [GENERAL] Corruption of files in PostgreSQL Hi everyone, a little update. We have upgraded our system to 7.4.17. The problem of truncated files seems now better, but it is still present. We have not found a clearly understandable pattern on why this happens. Just to provide some further information: - we create a file and store on the DB; - we give the file to the user, and he can modify at its wish the file; - we store back the modified file on the DB; - the last two points can happen several times. Any hint? Best regards. Paolo Bizzarri Icube S.r.l. On 5/30/07, Purusothaman A <purusothaman.a@gmail.com> wrote: > Paolo Bizzarri, > > I am also using postgresql in my application and also facing file object > corruption problem. > > I already discussed several times with Richard Huxton, and ended without any > clue. > > Here I am briefing my problem, see if u find any clue about it. > I am storing/retrieving my file in postgresql using lo_export() and > lo_import() api. > > after few weeks (as application is being used - number of file objects in > database also grows) my file object gets corrupted. And I have no clue about > which causes this problem. > > I confirmed the file corruption by the following query, > > sfrs2=> select loid, pageno, length(data) from pg_largeobject where loid = > 101177 and pageno = 630; > loid | pageno | length > --------+--------+-------- > 101177 | 630 | 181 > (1 row) > > But actually the result of the above query before corruption(ie, immediately > after file object added to table) > > fasp_test=> select loid, pageno, length(data) from pg_largeobject where loid > = 106310 and pageno = 630; > loid | pageno | length > --------+--------+-------- > 106310 | 630 | 205 > (1 row) > > I uploaded same file in both(sfrs2, fasp_test) databases. The first one > result is after the corruption. and the later is before corruption. > > You also confirm you problem like this. And I strongly believe that, there > is some bug in PostgreSQL. > > Kindly don't forget to alert me once u find solution/cause. > > Regards, > Purusothaman A > > > On 5/30/07, Paolo Bizzarri <pibizza@gmail.com> wrote: > > > > On 5/30/07, Richard Huxton <dev@archonet.com> wrote: > > > Paolo Bizzarri wrote: > > > > We use postgres as a backend, and we are experimenting some corruption > > > > problems on openoffice files. > > > > > > 1. How are you storing these files? > > > > Files are stored as large objects. They are written with an lo_write > > and its contents is passed as a Binary object. > > > > > 2. What is the nature of the corruption? > > > > Apparently, files get truncated. > > > > > > As our application is rather complex (it includes Zope as an > > > > application server, OpenOffice as a document server and as a client) > > > > we need some info on how to check that we are interacting correctly > > > > with Postgres. > > > > > > Shouldn't matter. > > > > I hope so... > > > > > > We are currently using: > > > > > > > > - PostgreSQL 7.4.8; > > > > > > Well, you need to upgrade this - version 7.4.17 is the latest in the 7.4 > > > series. You are missing 9 separate batches of bug and security fixes. > > > > Ok. We will upgrade and see if this can help solve the problem. > > > > > > > > > - pyscopg 1.1.11 ; > > > > - Zope 2.7.x; > > > > - Openoffice 2.2. > > > > > > None of this should matter really, unless there's some subtle bug in > > > psycopg causing corruption of data in-transit. > > > > > > Let's get some details on the two questions above and see if there's a > > > pattern to your problems. > > > > Ok. Thank you. > > > > Paolo Bizzarri > > Icube S.r.l. > > > > ---------------------------(end of > broadcast)--------------------------- > > TIP 5: don't forget to increase your free space map settings > > > > > > -- > http://PurusothamanA.wordpress.com/ ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Paolo Bizzarri wrote: > On 6/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> "Paolo Bizzarri" <pibizza@gmail.com> writes: >> > On 6/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Please provide a reproducible test case ... >> >> > as explained above, the problem seems quite random. So I need to >> > understand what we have to check. >> >> In this context "reproducible" means that the failure happens >> eventually. I don't care if the test program only fails once in >> thousands of tries --- I just want a complete self-contained example >> that produces a failure. > > As said above, our application is rather complex and involves several > different pieces of software, including Zope, OpenOffice both as > server and client, and PostgreSQL. We are absolutely NOT sure that the > problem is inside PostgreSQL. > > What we are trying to understand is, first and foremost, if there are > known cases under which PostgreSQL can truncate a file. I would suspect either your hardware (RAID controller, hard drive, cache etc) or your OS (kernel bug, file system bug, etc) For instance: http://lwn.net/Articles/215868/ documents a bug in the 2.6 linux kernel that can result in corrupted files if there are a lot of processes accessing it at once.
Hi Scott, in fact, we were using a 2.6.12 kernel. Can this be a problem? Best regards. Paolo Bizzarri On 6/4/07, Scott Marlowe <smarlowe@g2switchworks.com> wrote: > Paolo Bizzarri wrote: > > On 6/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> "Paolo Bizzarri" <pibizza@gmail.com> writes: > >> > On 6/2/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> >> Please provide a reproducible test case ... > >> > >> > as explained above, the problem seems quite random. So I need to > >> > understand what we have to check. > >> > >> In this context "reproducible" means that the failure happens > >> eventually. I don't care if the test program only fails once in > >> thousands of tries --- I just want a complete self-contained example > >> that produces a failure. > > > > As said above, our application is rather complex and involves several > > different pieces of software, including Zope, OpenOffice both as > > server and client, and PostgreSQL. We are absolutely NOT sure that the > > problem is inside PostgreSQL. > > > > What we are trying to understand is, first and foremost, if there are > > known cases under which PostgreSQL can truncate a file. > > I would suspect either your hardware (RAID controller, hard drive, cache > etc) or your OS (kernel bug, file system bug, etc) > > For instance: > > http://lwn.net/Articles/215868/ > > documents a bug in the 2.6 linux kernel that can result in corrupted > files if there are a lot of processes accessing it at once. > >
On Tue, 5 Jun 2007, Paolo Bizzarri wrote: > On 6/4/07, Scott Marlowe <smarlowe@g2switchworks.com> wrote: >> http://lwn.net/Articles/215868/ >> documents a bug in the 2.6 linux kernel that can result in corrupted >> files if there are a lot of processes accessing it at once. > > in fact, we were using a 2.6.12 kernel. Can this be a problem? That particular problem appears to be specific to newer kernels so I wouldn't think it's related to your issue. Tracking down random crashes of the sort you're reporting is hard. As Scott rightly suggested, the source of problem could be easily be any number of hardware components or low-level software like the kernel. The tests required to really certify that a server is suitable for production use can take several days worth of testing. The normal approach here would be to move this application+data to another system and see if the problem is still there; that lets you rule out all the hardware at once. That would do something else you should be thinking about--making absolutely sure you can backup and restore your data, and that the corruption you're seeing isn't causing information to be lost in your database. The general flow of figuring out the cause for random problems goes something like this: 1) Check for memory errors. http://www.memtest86.com/ is a good tool for PCs. That will need to run for many hours. 2) Run the manufacturer's disk utilities to see if any of your disks are going bad. You might be able to do this using Linux's SMART tools instead without even taking the server down; if you're not using those already you should look into that. http://www.linuxjournal.com/article/6983 is a good intro here. 3) Boot another version of Linux and run some low-level disk tests there. A live CD/DVD like Knoppix and Ubuntu is the easiest way to do that. 4) If everything above passes, upgrade to the kernel version used on the live CD/DVD and see if the problem goes away. You can try skipping right to #4 here and playing with the kernel first, but understand that if your underlying hardware has issues, that may cause more corruption (with possible data loss) rather than less. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > On Tue, 5 Jun 2007, Paolo Bizzarri wrote: > >> On 6/4/07, Scott Marlowe <smarlowe@g2switchworks.com> wrote: >>> http://lwn.net/Articles/215868/ >>> documents a bug in the 2.6 linux kernel that can result in corrupted >>> files if there are a lot of processes accessing it at once. >> >> in fact, we were using a 2.6.12 kernel. Can this be a problem? > > That particular problem appears to be specific to newer kernels so I > wouldn't think it's related to your issue. That is not entirely correct. The problem was present all the way back to the 2.5 kernels, before the 2.6 kernels were released. However, there was an update to the 2.6.18/19 kernels that made this problem much more likely to bite. There were reports of data loss for many people running on older 2.6 kernels that mysteriously went away after updating to post 2.6.19 kernels (or in the case of redhat, the updated 2.6.9-44 or so kernels, which backported the fix.) So, it IS possible that it's the kernel, but not likely. I'm still betting on a bad RAID controller or something like that. But updating the kernel probably wouldn't be a bad idea.
On 6/5/07, Scott Marlowe <smarlowe@g2switchworks.com> wrote: > Greg Smith wrote: > > On Tue, 5 Jun 2007, Paolo Bizzarri wrote: > > > >> On 6/4/07, Scott Marlowe <smarlowe@g2switchworks.com> wrote: > >>> http://lwn.net/Articles/215868/ > >>> documents a bug in the 2.6 linux kernel that can result in corrupted > >>> files if there are a lot of processes accessing it at once. > >> > >> in fact, we were using a 2.6.12 kernel. Can this be a problem? > > > > That particular problem appears to be specific to newer kernels so I > > wouldn't think it's related to your issue. > > That is not entirely correct. The problem was present all the way back > to the 2.5 kernels, before the 2.6 kernels were released. However, > there was an update to the 2.6.18/19 kernels that made this problem much > more likely to bite. There were reports of data loss for many people > running on older 2.6 kernels that mysteriously went away after updating > to post 2.6.19 kernels (or in the case of redhat, the updated 2.6.9-44 > or so kernels, which backported the fix.) > I understand this. At the same time, the system was under quite heavy load, so it is possible that some peculiar, rather subtle bug was biting us. There were many files manipulated all in the same way, but only some (really little of them) were truncated. I would like to remove all possible known cases of bugs. BTW, as ou Postgresql was recompiled from sources, do you suggest to recompile the whole after upgrading the kernel? > So, it IS possible that it's the kernel, but not likely. I'm still > betting on a bad RAID controller or something like that. But updating > the kernel probably wouldn't be a bad idea. > The deployed configuration is quite large (two servers using a shared SCSI-to-IDE large disk array), and it would be quite difficult to replicate a different configuration. At the same time, problems were visible only under heavy load, so using a simpler system would not really help. Ciao Paolo Bizzarri Icube S.r.l.
Ugrade to a new kernel would be a very good idea, but you will have to test the new kernel too. In my oponion I believe that the reason for your problem is either the linux kernel or your hardware. I guess you have to update your kernel because of security bugs too. Changing the Kernel is maybe easier then changig the hardware. Greetings, -Franz -----Ursprüngliche Nachricht----- Von: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von Paolo Bizzarri Gesendet: Mittwoch, 6. Juni 2007 10:18 An: Scott Marlowe Cc: Greg Smith; pgsql-general@postgresql.org Betreff: Re: [GENERAL] Corruption of files in PostgreSQL On 6/5/07, Scott Marlowe <smarlowe@g2switchworks.com> wrote: > Greg Smith wrote: > > On Tue, 5 Jun 2007, Paolo Bizzarri wrote: > > > >> On 6/4/07, Scott Marlowe <smarlowe@g2switchworks.com> wrote: > >>> http://lwn.net/Articles/215868/ > >>> documents a bug in the 2.6 linux kernel that can result in corrupted > >>> files if there are a lot of processes accessing it at once. > >> > >> in fact, we were using a 2.6.12 kernel. Can this be a problem? > > > > That particular problem appears to be specific to newer kernels so I > > wouldn't think it's related to your issue. > > That is not entirely correct. The problem was present all the way back > to the 2.5 kernels, before the 2.6 kernels were released. However, > there was an update to the 2.6.18/19 kernels that made this problem much > more likely to bite. There were reports of data loss for many people > running on older 2.6 kernels that mysteriously went away after updating > to post 2.6.19 kernels (or in the case of redhat, the updated 2.6.9-44 > or so kernels, which backported the fix.) > I understand this. At the same time, the system was under quite heavy load, so it is possible that some peculiar, rather subtle bug was biting us. There were many files manipulated all in the same way, but only some (really little of them) were truncated. I would like to remove all possible known cases of bugs. BTW, as ou Postgresql was recompiled from sources, do you suggest to recompile the whole after upgrading the kernel? > So, it IS possible that it's the kernel, but not likely. I'm still > betting on a bad RAID controller or something like that. But updating > the kernel probably wouldn't be a bad idea. > The deployed configuration is quite large (two servers using a shared SCSI-to-IDE large disk array), and it would be quite difficult to replicate a different configuration. At the same time, problems were visible only under heavy load, so using a simpler system would not really help. Ciao Paolo Bizzarri Icube S.r.l. ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match