Thread: Lost rows/data corruption?
Hello, We have a long running DB application using PG7.4.6. We do a VACUUM FULL every night and a normal 'maintenance' VACUUM every hour. We do nothing with any indexes. Every now and then we get errors from the database whereby an update will fail on a table saying that there is duplicate violation on a primary keyed row. Theoretically this is impossible as the constraint should prevent any duplication, ever. When the next vacuum is run, we see an error saying that there are more rows than the database can account for (or words to that effect) and we should either REINDEX or TRUNCATE the table. Firstly, how and why could this be happening, and above all how, do we prevent it. Thanks and regards, Andrew.
"Andrew Hall" <temp02@bluereef.com.au> writes: > We have a long running DB application using PG7.4.6. We do a VACUUM FULL > every night and a normal 'maintenance' VACUUM every hour. We do nothing with > any indexes. Every now and then we get errors from the database whereby an > update will fail on a table saying that there is duplicate violation on a > primary keyed row. Theoretically this is impossible as the constraint should > prevent any duplication, ever. When the next vacuum is run, we see an error > saying that there are more rows than the database can account for (or words > to that effect) and we should either REINDEX or TRUNCATE the table. Could we see the exact log trace, rather than handwaving? Also the schemas of the table(s)/index(es) involved might be useful. regards, tom lane
Tom, Here is the data you requested. It took little while to gather it as this kind of corruption doesn't happen all the time. The first sign that we know something is wrong is our application freezing when communicating with the DB. If we then issue a vacuum, here's the output: sonar=# VACUUM FULL; WARNING: index "user_session_pkey" contains 85613 row versions, but table contains 85513 row versions HINT: Rebuild the index with REINDEX. WARNING: index "user_session_pkey" contains 85613 row versions, but table contains 85513 row versions HINT: Rebuild the index with REINDEX. VACUUM If we the try a reindex, here's the output: sonar=# REINDEX TABLE user_session; ERROR: could not create unique index DETAIL: Table contains duplicated values. There are duplicates and massive amounts of erroneous data in most columns in the table. From here, no amount of row deletion seems to clear the issue, the only remedy is a truncate on the table and repopulate. In fact if we try even a select we get this: WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. Here is the schema for the table: sonar=# \d user_session; Table "public.user_session" Column | Type | Modifiers ----------------+-----------------------------+--------------------------------- ------------------- user_id | integer | not null session_id | integer | not null group_id | integer | src | character varying(22) | not null hit_bytes_in | bigint | miss_bytes_in | bigint | logon_time | timestamp without time zone | default ('now'::text)::timestamp (6) with time zone logoff_time | timestamp without time zone | default ('now'::text)::timestamp (6) with time zone account_active | character(1) | default 'Y'::bpchar hit_bytes_out | bigint | miss_bytes_out | bigint | cost_bytes_in | double precision | cost_bytes_out | double precision | time_cost | double precision | Indexes: "user_session_pkey" primary key, btree (user_id, session_id) Foreign-key constraints: "$1" FOREIGN KEY (user_id) REFERENCES user_table(user_id) ON DELETE CASCADE Triggers: delsessionusagetrigger AFTER DELETE ON user_session FOR EACH ROW EXECUTE PRO CEDURE delsessionusagefunc() I'm most concerned that a primary constraint is being actively violated, and I don't understand how this is possible. Any help appreciated. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Andrew Hall" <temp02@bluereef.com.au> Cc: <pgsql-general@postgresql.org> Sent: Friday, February 04, 2005 10:12 AM Subject: Re: [GENERAL] Lost rows/data corruption? > "Andrew Hall" <temp02@bluereef.com.au> writes: >> We have a long running DB application using PG7.4.6. We do a VACUUM FULL >> every night and a normal 'maintenance' VACUUM every hour. We do nothing >> with >> any indexes. Every now and then we get errors from the database whereby >> an >> update will fail on a table saying that there is duplicate violation on a >> primary keyed row. Theoretically this is impossible as the constraint >> should >> prevent any duplication, ever. When the next vacuum is run, we see an >> error >> saying that there are more rows than the database can account for (or >> words >> to that effect) and we should either REINDEX or TRUNCATE the table. > > Could we see the exact log trace, rather than handwaving? Also the > schemas of the table(s)/index(es) involved might be useful. > > regards, tom lane
"Andrew Hall" <temp02@bluereef.com.au> writes: > Here is the data you requested. It took little while to gather it as this > kind of corruption doesn't happen all the time. It sounds like a mess, all right. Do you have a procedure to follow to replicate this havoc? Are you sure there's not a hardware problem underlying it all? regards, tom lane
>> It sounds like a mess, all right. Do you have a procedure to follow to >> replicate this havoc? Are you sure there's not a hardware problem >> underlying it all? >> >> regards, tom lane >> We haven't been able to isolate what causes it but it's unlikely to be hardware as it happens on quite a few of our customer's boxes. We also use XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to power-outages. Any ideas as to how I might go about isolating this? Have you heard any other reports of this kind and suggested remedies?
"Andrew Hall" <temp02@bluereef.com.au> writes: > We haven't been able to isolate what causes it but it's unlikely to be > hardware as it happens on quite a few of our customer's boxes. Okay, then not hardware; but it seems like you ought to be in a position to create a test case for other people to poke at. I don't insist on a 100% reproducible case, but something that will show the problem if run for awhile would be a great help. regards, tom lane
Tom Lane wrote: > "Andrew Hall" <temp02@bluereef.com.au> writes: > >> We haven't been able to isolate what causes it but it's unlikely to be >> hardware as it happens on quite a few of our customer's boxes. > > > Okay, then not hardware; but it seems like you ought to be in a position > to create a test case for other people to poke at. I don't insist on > a 100% reproducible case, but something that will show the problem if > run for awhile would be a great help. His original statement prompts a question in my mind. I may be wrong here, but when he noted: 'We also use XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to power-outages.' Is Andrew indicating here that there might be some issues with power loss on some of these boxes? If so, is it reasonable to assume that the filesystem is able to maintain the database integrity in such a power loss? I understand that XFS is quite a robust file system, but I can't see relying on such robustness for database integrity (or any file integrity for that matter). UPS's might be a better solution. So the actual question in my mind that I didn't see anyone touch on is, is it safe to assume that a power outage will not affect the database integrity based on the robustness of the file system type? Personally, I would not rely on such, but I'd like to hear what the Postgresql experts think about this issue. Then again, I may have read too much into Andrew's post. Andrew, do you assume there have been power issues with any of these machines? Are you comfortable relying on the filesystem to deal with such issues? Ideally, I would research any corrolation between power outages and the database problems. If there are no power outages to speak of, then sorry for yapping up the wrong woody perennial plant. Really just fishing for some insights here folks. -- Until later, Geoffrey
On Tue, 2005-02-15 at 04:56, Geoffrey wrote: > Tom Lane wrote: > > "Andrew Hall" <temp02@bluereef.com.au> writes: > > > >> We haven't been able to isolate what causes it but it's unlikely to be > >> hardware as it happens on quite a few of our customer's boxes. > > > > > > Okay, then not hardware; but it seems like you ought to be in a position > > to create a test case for other people to poke at. I don't insist on > > a 100% reproducible case, but something that will show the problem if > > run for awhile would be a great help. > > His original statement prompts a question in my mind. I may be wrong > here, but when he noted: > > 'We also use XFS on linux 2.6 as a file system, so the FS should be > fairly tolerant to power-outages.' > > Is Andrew indicating here that there might be some issues with power > loss on some of these boxes? If so, is it reasonable to assume that the > filesystem is able to maintain the database integrity in such a power > loss? I understand that XFS is quite a robust file system, but I can't > see relying on such robustness for database integrity (or any file > integrity for that matter). UPS's might be a better solution. If I were him I'd try running my database on a different file system to see if his version of XFS might be causing these problems. While I agree that frequent power loss is NOT something a database should be exposed to, a properly setup machine with a properly functioning journalling file system should not experience these problems. Might be time to check the drive subsystem to make sure it's properly fsyncing data.
On Tue, 15 Feb 2005, Andrew Hall wrote: > > >>> It sounds like a mess, all right. Do you have a procedure to follow to >>> replicate this havoc? Are you sure there's not a hardware problem >>> underlying it all? >>> >>> regards, tom lane >>> > > We haven't been able to isolate what causes it but it's unlikely to be > hardware as it happens on quite a few of our customer's boxes. We also use > XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to > power-outages. Any ideas as to how I might go about isolating this? Have you > heard any other reports of this kind and suggested remedies? Are you running with fsync = off? and did the hosts experience any power-outage recently? .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
fsync is on for all these boxes. Our customers run their own hardware with many different specification of hardware in use. Many of our customers don't have UPS, although their power is probably pretty reliable (normal city based utilities), but of course I can't guarantee they don't get an outage once in a while with a thunderstorm etc. The problem here is that we are consistently seeing the same kind of corruption and symptoms across a fairly large number of customers (52 have reported this problem), so there is something endemic happening here that to be honest, I'm surprised no one else is seeing. Fundamentally there is nothing particularly abnormal with our application or data, but regardless, I would have thought these kind of things (application design, data representation etc) irrelevant to the reliability of the database not to allow duplicate data on a primary key. Something is causing this corruption, and one thing we do know is that it doesn't happen immediately with a new installation, it takes time (several months of usage) before we start to see this condition. I'd be really surprised if XFS is the problem as I know there are plenty of other people across the world using it reliability with PG. We're going to see if we can build a test environment that can forcibly cause this but I don't hold much hope, as we've tried to isolate it before with little success. Here's what we tried changing when we originally went searching for the problem, and it still here: - the hardware (tried single CPU instead of dual - though that maybe an issue with the OS) - the OS version (tried Linux 2.6.5, 2.6.6, 2.6.7, 2.6.8.1, 2.6.10 and 2.4.22) - all using XFS - the database table layout (tried changing the way the data is stored) - the version of Jetty (servlet engine) - the DB pool manager and PG JDBC driver versions - the version of PG (tried two or three back from the latest) - various vacuum regimes ----- Original Message ----- From: "Marco Colombo" <pgsql@esiway.net> To: "Andrew Hall" <temp02@bluereef.com.au> Cc: <pgsql-general@postgresql.org> Sent: Wednesday, February 16, 2005 2:58 AM Subject: Re: Lost rows/data corruption? > On Tue, 15 Feb 2005, Andrew Hall wrote: > >> >> >>>> It sounds like a mess, all right. Do you have a procedure to follow to >>>> replicate this havoc? Are you sure there's not a hardware problem >>>> underlying it all? >>>> >>>> regards, tom lane >>>> >> >> We haven't been able to isolate what causes it but it's unlikely to be >> hardware as it happens on quite a few of our customer's boxes. We also >> use >> XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to >> power-outages. Any ideas as to how I might go about isolating this? Have >> you >> heard any other reports of this kind and suggested remedies? > > Are you running with fsync = off? and did the hosts experience any > power-outage recently? > > .TM. > -- > ____/ ____/ / > / / / Marco Colombo > ___/ ___ / / Technical Manager > / / / ESI s.r.l. > _____/ _____/ _/ Colombo@ESI.it >
On Wed, 16 Feb 2005, Andrew Hall wrote: > fsync is on for all these boxes. Our customers run their own hardware with > many different specification of hardware in use. Many of our customers don't > have UPS, although their power is probably pretty reliable (normal city based > utilities), but of course I can't guarantee they don't get an outage once in > a while with a thunderstorm etc. I see. Well I can't help much, then, I don't run PG on XFS. I suggest testing on a different FS, to exclude XFS problems. But with fsync on, the FS has very little to do with reliability, unless it _lies_ about fsync(). Any FS should return from fsync only after data is on disc, journal or not (there might be issues with meta-data, but it's hardly a problem with PG). It's more likely the hardware (IDE disks) lies about data being on plate. But again that's only in case of sudden poweroffs. [...] > this condition. I'd be really surprised if XFS is the problem as I know there > are plenty of other people across the world using it reliability with PG. This is kind of OT, but I don't follow your logic here. I don't see why plenty of success stories of XFS+PG suggest to you the culprit is PG. To me it's still 50% - 50%. :-) Moreover, XFS is continuosly updated (as it follows normal linux kernel fast release cycle, like any other linux FS), so it's hard to make a data point unless someone else is using _exactly_ the same versions as you do. For example, in kernel changelog from 2.6.7 to 2.6.10 you can read: "[XFS] Fix a race condition in the undo-delayed-write buffer routine." "[XFS] Fix up memory allocators to be more resilient." "[XFS] Fix a possible data loss issue after an unaligned unwritten extent write." "[XFS] handle inode creating race" (only a few of them) Now, I don't have even the faintest idea if that might have affected you or nor, but still the point is that the linux kernel changes a lot. And vendors tend to customize their kernels a lot, too. On the PostreSQL side, releases are slowly-paced, so it's easier. Anyway, I agree your problem is weird, and that it must be something on the server side. No matter what you do on the client side (pool manager, JDBC driver, servlets engige), in no way the DB should get corrupted with duplicated primary keys. I know this is a silly question, but when you write 'We do nothing with any indexes' do you mean indeces are never, _never_ touched (I mean explicitly, as in drop/create index), i.e. they are created at schema creation time and then left alone? Just to make sure... .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
Marco Colombo wrote: > On Wed, 16 Feb 2005, Andrew Hall wrote: > >> fsync is on for all these boxes. Our customers run their own hardware >> with many different specification of hardware in use. Many of our >> customers don't have UPS, although their power is probably pretty >> reliable (normal city based utilities), but of course I can't >> guarantee they don't get an outage once in a while with a thunderstorm >> etc. > > > I see. Well I can't help much, then, I don't run PG on XFS. I suggest > testing > on a different FS, to exclude XFS problems. But with fsync on, the FS has > very little to do with reliability, unless it _lies_ about fsync(). Any > FS should return from fsync only after data is on disc, journal or not > (there might be issues with meta-data, but it's hardly a problem with PG). > > It's more likely the hardware (IDE disks) lies about data being on plate. > But again that's only in case of sudden poweroffs. Do you happen to have the same type disks in all these systems? That could point to a disk cache "problem" (f.e. the disks lying about having written data from the cache to disk). Or do you use the same disk parameters on all these machines? Have you tried using the disks w/o write caching and/or in synchronous mode (contrary to "async"). -- Alban Hertroys MAG Productions T: +31(0)53 4346874 F: +31(0)53 4346876 E: alban@magproductions.nl W: http://www.magproductions.nl
On Wed, 2005-02-16 at 07:14, Alban Hertroys wrote: > Marco Colombo wrote: > > On Wed, 16 Feb 2005, Andrew Hall wrote: > > > >> fsync is on for all these boxes. Our customers run their own hardware > >> with many different specification of hardware in use. Many of our > >> customers don't have UPS, although their power is probably pretty > >> reliable (normal city based utilities), but of course I can't > >> guarantee they don't get an outage once in a while with a thunderstorm > >> etc. > > > > > > I see. Well I can't help much, then, I don't run PG on XFS. I suggest > > testing > > on a different FS, to exclude XFS problems. But with fsync on, the FS has > > very little to do with reliability, unless it _lies_ about fsync(). Any > > FS should return from fsync only after data is on disc, journal or not > > (there might be issues with meta-data, but it's hardly a problem with PG). > > > > It's more likely the hardware (IDE disks) lies about data being on plate. > > But again that's only in case of sudden poweroffs. > > Do you happen to have the same type disks in all these systems? That > could point to a disk cache "problem" (f.e. the disks lying about having > written data from the cache to disk). > > Or do you use the same disk parameters on all these machines? Have you > tried using the disks w/o write caching and/or in synchronous mode > (contrary to "async"). I was wondering if this problem had ever shown up on a machine that HADN'T lost power abrubtly or not. IFF the only machines that experience corruption have lost power beforehand sometime, then I would look towards either the drives, controller or file system or somewhere in there. I know there are write modes in ext3 that will allow corruption on power loss (I think it's writeback). I know little of XFS in a production environment, as I run ext3, warts and all.
On Wed, 16 Feb 2005, Scott Marlowe wrote: > I know there are write modes in ext3 that will allow corruption on power > loss (I think it's writeback). I know little of XFS in a production > environment, as I run ext3, warts and all. Yeah, but even in writeback mode, ext3 doesn't lie on fsync. No FS does. Since PG can't expect any data to be on disk _before_ fsync completes, it doesn't really make a difference. You can loose data in writeback mode _if_ the application is not fsync-ing it (XFS only "mode" is similar to writeback). I'm not aware of any case in which the system can lie about fsync(), unless the hardware is lying in turn. One question for gurus: does PG use fsync() on dirty data pages when they are flushed to disk at checkpoint time? Does it fsync() the directory in case of file creation/deletion/rename? .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Andrew Hall" <temp02@bluereef.com.au> Cc: <pgsql-general@postgresql.org> Sent: Tuesday, February 15, 2005 3:25 PM Subject: Re: [GENERAL] Lost rows/data corruption? > "Andrew Hall" <temp02@bluereef.com.au> writes: >> Here is the data you requested. It took little while to gather it as this >> kind of corruption doesn't happen all the time. > > It sounds like a mess, all right. Do you have a procedure to follow to > replicate this havoc? Are you sure there's not a hardware problem > underlying it all? > > regards, tom lane > We haven't been able to isolate what causes it but it's unlikely to be hardware as it happens on quite a few of our customer's boxes. We also use XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to power-outages. Any ideas as to how I might go about isolating this? Have you heard any other reports of this kind and suggested remedies?
> I know this is a silly question, but when you write 'We do nothing with > any indexes' do you mean indeces are never, _never_ touched (I mean > explicitly, as in drop/create index), i.e. they are created at schema > creation time and then left alone? Just to make sure... Hi and thanks for your feedback, Yes we never touch them, as in, they are implicitly created at schema create time and then we don't touch them.
> Do you happen to have the same type disks in all these systems? That could > point to a disk cache "problem" (f.e. the disks lying about having written > data from the cache to disk). > > Or do you use the same disk parameters on all these machines? Have you > tried using the disks w/o write caching and/or in synchronous mode > (contrary to "async"). It's all pretty common stuff, quite a few customers use standard IDE (various flavours of controller/disk), some now use SATA (again various brands) and the rest use SCSI. The kernel we use is the standard Linus approved kernel with the inbuilt drivers as part of the kernel. We don't supply any non-default parameters to the disk controllers. Thanks for your suggestion on write caching, I'll look into this, I'm also tempted to try a different journalling FS too.
> I was wondering if this problem had ever shown up on a machine that > HADN'T lost power abrubtly or not. IFF the only machines that > experience corruption have lost power beforehand sometime, then I would > look towards either the drives, controller or file system or somewhere > in there. I can't be sure. We have an automated maintenance process that reboots all our customers machines every 10 days at 2am. Having said this I'm now wondering if this may have something to do with the issue. This automated process issues a 'shutdown' to the database (and all other processes), waits 20 seconds and then issues a 'reboot' to the kernel. If the database was still processing, the active postmaster process may wait for the client to complete the query before allowing it to close, but I'm assuming that if this exceeds 20 seconds, the kernel will issue a 'sigquit' to the process tree and reboot immediately. Could this cause corruption?
On Thu, Feb 17, 2005 at 07:40:25PM +1100, Andrew Hall wrote: > > We have an automated maintenance process that reboots all our > customers machines every 10 days at 2am. What's the purpose of doing this? If it's necessary then the reboots aren't really fixing anything. Is whatever problem that prompted this procedure being investigated so a permanent fix can be applied? -- Michael Fuhr http://www.fuhr.org/~mfuhr/
"Andrew Hall" <temp02@bluereef.com.au> writes: > I can't be sure. We have an automated maintenance process that reboots all > our customers machines every 10 days at 2am. Why? Sounds like a decision made by someone who is used to Windows. I've never seen any variant of Unix that needed that. regards, tom lane
We do the maintenance reboot (and other various log cleanups etc) as part of our normal maintenance practice. We don't really 'need' to do this, however we've traditionally found that operating systems perform better with an occassional reboot (cleanup fragmented memory etc.). ----- Original Message ----- From: "Michael Fuhr" <mike@fuhr.org> To: "Andrew Hall" <temp02@bluereef.com.au> Cc: "Scott Marlowe" <smarlowe@g2switchworks.com>; "Alban Hertroys" <alban@magproductions.nl>; "Marco Colombo" <pgsql@esiway.net>; <pgsql-general@postgresql.org> Sent: Friday, February 18, 2005 1:56 AM Subject: Re: [GENERAL] Lost rows/data corruption? > On Thu, Feb 17, 2005 at 07:40:25PM +1100, Andrew Hall wrote: >> >> We have an automated maintenance process that reboots all our >> customers machines every 10 days at 2am. > > What's the purpose of doing this? If it's necessary then the reboots > aren't really fixing anything. Is whatever problem that prompted > this procedure being investigated so a permanent fix can be applied? > > -- > Michael Fuhr > http://www.fuhr.org/~mfuhr/ >
A unix box or a mainframe that needs rebooting is generally considered broken. On Thu, 2005-02-17 at 17:11, Andrew Hall wrote: > We do the maintenance reboot (and other various log cleanups etc) as part of > our normal maintenance practice. We don't really 'need' to do this, however > we've traditionally found that operating systems perform better with an > occassional reboot (cleanup fragmented memory etc.). > > ----- Original Message ----- > From: "Michael Fuhr" <mike@fuhr.org> > To: "Andrew Hall" <temp02@bluereef.com.au> > Cc: "Scott Marlowe" <smarlowe@g2switchworks.com>; "Alban Hertroys" > <alban@magproductions.nl>; "Marco Colombo" <pgsql@esiway.net>; > <pgsql-general@postgresql.org> > Sent: Friday, February 18, 2005 1:56 AM > Subject: Re: [GENERAL] Lost rows/data corruption? > > > > On Thu, Feb 17, 2005 at 07:40:25PM +1100, Andrew Hall wrote: > >> > >> We have an automated maintenance process that reboots all our > >> customers machines every 10 days at 2am. > > > > What's the purpose of doing this? If it's necessary then the reboots > > aren't really fixing anything. Is whatever problem that prompted > > this procedure being investigated so a permanent fix can be applied? > > > > -- > > Michael Fuhr > > http://www.fuhr.org/~mfuhr/ > > >
Quoting Andrew Hall <temp02@bluereef.com.au>: > > Do you happen to have the same type disks in all these systems? That could > > > point to a disk cache "problem" (f.e. the disks lying about having written > > > data from the cache to disk). > > > > Or do you use the same disk parameters on all these machines? Have you > > tried using the disks w/o write caching and/or in synchronous mode > > (contrary to "async"). > > It's all pretty common stuff, quite a few customers use standard IDE > (various flavours of controller/disk), some now use SATA (again various > brands) and the rest use SCSI. The kernel we use is the standard Linus > approved kernel with the inbuilt drivers as part of the kernel. We don't > supply any non-default parameters to the disk controllers. > > Thanks for your suggestion on write caching, I'll look into this, I'm also > tempted to try a different journalling FS too. > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > I'm a little late on this thread but in regards to the SATA support. 2.4.29 in my experience is really the first kernel that decent SATA support (i.e. much better data throughput). I think that would corresponse to 2.6.9 or .10 but even before you get into all that. I am curious to know what do you mean by "standard Linus kernel". Do you not compile your own kernels for the hardware platform being used? -- Keith C. Perry, MS E.E. Director of Networks & Applications VCSN, Inc. http://vcsn.com ____________________________________ This email account is being host by: VCSN, Inc : http://vcsn.com
Yes, we compile our own kernel based on the "stardardised" stable release available at the time. Everything we need is compiled in. This is what I mean by standard Linus approved kernel release (as opposed to an AC/MM modified release etc.) ----- Original Message ----- From: "Keith C. Perry" <netadmin@vcsn.com> To: "Andrew Hall" <temp02@bluereef.com.au> Cc: "Alban Hertroys" <alban@magproductions.nl>; "Marco Colombo" <pgsql@esiway.net>; <pgsql-general@postgresql.org> Sent: Saturday, February 26, 2005 6:02 AM Subject: Re: [GENERAL] Lost rows/data corruption? > Quoting Andrew Hall <temp02@bluereef.com.au>: > >> > Do you happen to have the same type disks in all these systems? That >> > could >> >> > point to a disk cache "problem" (f.e. the disks lying about having >> > written >> >> > data from the cache to disk). >> > >> > Or do you use the same disk parameters on all these machines? Have you >> > tried using the disks w/o write caching and/or in synchronous mode >> > (contrary to "async"). >> >> It's all pretty common stuff, quite a few customers use standard IDE >> (various flavours of controller/disk), some now use SATA (again various >> brands) and the rest use SCSI. The kernel we use is the standard Linus >> approved kernel with the inbuilt drivers as part of the kernel. We don't >> supply any non-default parameters to the disk controllers. >> >> Thanks for your suggestion on write caching, I'll look into this, I'm >> also >> tempted to try a different journalling FS too. >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 2: you can get off all lists at once with the unregister command >> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >> > > I'm a little late on this thread but in regards to the SATA support. > 2.4.29 in > my experience is really the first kernel that decent SATA support (i.e. > much > better data throughput). I think that would corresponse to 2.6.9 or .10 > but > even before you get into all that. I am curious to know what do you mean > by > "standard Linus kernel". Do you not compile your own kernels for the > hardware > platform being used? > > -- > Keith C. Perry, MS E.E. > Director of Networks & Applications > VCSN, Inc. > http://vcsn.com > > ____________________________________ > This email account is being host by: > VCSN, Inc : http://vcsn.com >