Thread: Lost rows/data corruption?

Lost rows/data corruption?

From
"Andrew Hall"
Date:
Hello,

We have a long running DB application using PG7.4.6. We do a VACUUM FULL
every night and a normal 'maintenance' VACUUM every hour. We do nothing with
any indexes. Every now and then we get errors from the database whereby an
update will fail on a table saying that there is duplicate violation on a
primary keyed row. Theoretically this is impossible as the constraint should
prevent any duplication, ever. When the next vacuum is run, we see an error
saying that there are more rows than the database can account for (or words
to that effect) and we should either REINDEX or TRUNCATE the table.

Firstly, how and why could this be happening, and above all how, do we
prevent it.

Thanks and regards,

Andrew.


Re: Lost rows/data corruption?

From
Tom Lane
Date:
"Andrew Hall" <temp02@bluereef.com.au> writes:
> We have a long running DB application using PG7.4.6. We do a VACUUM FULL
> every night and a normal 'maintenance' VACUUM every hour. We do nothing with
> any indexes. Every now and then we get errors from the database whereby an
> update will fail on a table saying that there is duplicate violation on a
> primary keyed row. Theoretically this is impossible as the constraint should
> prevent any duplication, ever. When the next vacuum is run, we see an error
> saying that there are more rows than the database can account for (or words
> to that effect) and we should either REINDEX or TRUNCATE the table.

Could we see the exact log trace, rather than handwaving?  Also the
schemas of the table(s)/index(es) involved might be useful.

            regards, tom lane

Re: Lost rows/data corruption?

From
"Andrew Hall"
Date:
Tom,

Here is the data you requested. It took little while to gather it as this
kind of corruption doesn't happen all the time.

The first sign that we know something is wrong is our application freezing
when communicating with the DB.

If we then issue a vacuum, here's the output:

sonar=# VACUUM FULL;
WARNING:  index "user_session_pkey" contains 85613 row versions, but table
contains 85513 row versions
HINT:  Rebuild the index with REINDEX.
WARNING:  index "user_session_pkey" contains 85613 row versions, but table
contains 85513 row versions
HINT:  Rebuild the index with REINDEX.
VACUUM

If we the try a reindex, here's the output:

sonar=# REINDEX TABLE user_session;
ERROR:  could not create unique index
DETAIL:  Table contains duplicated values.

There are duplicates and massive amounts of erroneous data in most columns
in the table. From here, no amount of row deletion seems to clear the issue,
the only remedy is a truncate on the table and repopulate. In fact if we try
even a select we get this:

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

Here is the schema for the table:

sonar=# \d user_session;
                                    Table "public.user_session"
     Column     |            Type             |
Modifiers

----------------+-----------------------------+---------------------------------
-------------------
 user_id        | integer                     | not null
 session_id     | integer                     | not null
 group_id       | integer                     |
 src            | character varying(22)       | not null
 hit_bytes_in   | bigint                      |
 miss_bytes_in  | bigint                      |
 logon_time     | timestamp without time zone | default
('now'::text)::timestamp
(6) with time zone
 logoff_time    | timestamp without time zone | default
('now'::text)::timestamp
(6) with time zone
 account_active | character(1)                | default 'Y'::bpchar
 hit_bytes_out  | bigint                      |
 miss_bytes_out | bigint                      |
 cost_bytes_in  | double precision            |
 cost_bytes_out | double precision            |
 time_cost      | double precision            |
Indexes:
    "user_session_pkey" primary key, btree (user_id, session_id)
Foreign-key constraints:
    "$1" FOREIGN KEY (user_id) REFERENCES user_table(user_id) ON DELETE
CASCADE
Triggers:
    delsessionusagetrigger AFTER DELETE ON user_session FOR EACH ROW EXECUTE
PRO
CEDURE delsessionusagefunc()

I'm most concerned that a primary constraint is being actively violated, and
I don't understand how this is possible.

Any help appreciated.
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Andrew Hall" <temp02@bluereef.com.au>
Cc: <pgsql-general@postgresql.org>
Sent: Friday, February 04, 2005 10:12 AM
Subject: Re: [GENERAL] Lost rows/data corruption?


> "Andrew Hall" <temp02@bluereef.com.au> writes:
>> We have a long running DB application using PG7.4.6. We do a VACUUM FULL
>> every night and a normal 'maintenance' VACUUM every hour. We do nothing
>> with
>> any indexes. Every now and then we get errors from the database whereby
>> an
>> update will fail on a table saying that there is duplicate violation on a
>> primary keyed row. Theoretically this is impossible as the constraint
>> should
>> prevent any duplication, ever. When the next vacuum is run, we see an
>> error
>> saying that there are more rows than the database can account for (or
>> words
>> to that effect) and we should either REINDEX or TRUNCATE the table.
>
> Could we see the exact log trace, rather than handwaving?  Also the
> schemas of the table(s)/index(es) involved might be useful.
>
> regards, tom lane


Re: Lost rows/data corruption?

From
Tom Lane
Date:
"Andrew Hall" <temp02@bluereef.com.au> writes:
> Here is the data you requested. It took little while to gather it as this
> kind of corruption doesn't happen all the time.

It sounds like a mess, all right.  Do you have a procedure to follow to
replicate this havoc?  Are you sure there's not a hardware problem
underlying it all?

            regards, tom lane

Re: Lost rows/data corruption?

From
"Andrew Hall"
Date:

>> It sounds like a mess, all right.  Do you have a procedure to follow to
>> replicate this havoc?  Are you sure there's not a hardware problem
>> underlying it all?
>>
>> regards, tom lane
>>

 We haven't been able to isolate what causes it but it's unlikely to be
 hardware as it happens on quite a few of our customer's boxes. We also use
 XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
 power-outages. Any ideas as to how I might go about isolating this? Have
you
 heard any other reports of this kind and suggested remedies?


Re: Lost rows/data corruption?

From
Tom Lane
Date:
"Andrew Hall" <temp02@bluereef.com.au> writes:
>  We haven't been able to isolate what causes it but it's unlikely to be
>  hardware as it happens on quite a few of our customer's boxes.

Okay, then not hardware; but it seems like you ought to be in a position
to create a test case for other people to poke at.  I don't insist on
a 100% reproducible case, but something that will show the problem if
run for awhile would be a great help.

            regards, tom lane

Re: Lost rows/data corruption?

From
Geoffrey
Date:
Tom Lane wrote:
> "Andrew Hall" <temp02@bluereef.com.au> writes:
>
>> We haven't been able to isolate what causes it but it's unlikely to be
>> hardware as it happens on quite a few of our customer's boxes.
>
>
> Okay, then not hardware; but it seems like you ought to be in a position
> to create a test case for other people to poke at.  I don't insist on
> a 100% reproducible case, but something that will show the problem if
> run for awhile would be a great help.

His original statement prompts a question in my mind.  I may be wrong
here, but when he noted:

'We also use XFS on linux 2.6 as a file system, so the FS should be
fairly tolerant to power-outages.'

Is Andrew indicating here that there might be some issues with power
loss on some of these boxes?  If so, is it reasonable to assume that the
filesystem is able to maintain the database integrity in such a power
loss?  I understand that XFS is quite a robust file system, but I can't
see relying on such robustness for database integrity (or any file
integrity for that matter).  UPS's might be a better solution.

So the actual question in my mind that I didn't see anyone touch on is,
is it safe to assume that a power outage will not affect the database
integrity based on the robustness of the file system type?

Personally, I would not rely on such, but I'd like to hear what the
Postgresql experts think about this issue.

Then again, I may have read too much into Andrew's post.  Andrew, do you
assume there have been power issues with any of these machines?  Are you
comfortable relying on the filesystem to deal with such issues?

Ideally, I would research any corrolation between power outages and the
database problems.  If there are no power outages to speak of, then
sorry for yapping up the wrong woody perennial plant.

Really just fishing for some insights here folks.

--
Until later, Geoffrey

Re: Lost rows/data corruption?

From
Scott Marlowe
Date:
On Tue, 2005-02-15 at 04:56, Geoffrey wrote:
> Tom Lane wrote:
> > "Andrew Hall" <temp02@bluereef.com.au> writes:
> >
> >> We haven't been able to isolate what causes it but it's unlikely to be
> >> hardware as it happens on quite a few of our customer's boxes.
> >
> >
> > Okay, then not hardware; but it seems like you ought to be in a position
> > to create a test case for other people to poke at.  I don't insist on
> > a 100% reproducible case, but something that will show the problem if
> > run for awhile would be a great help.
>
> His original statement prompts a question in my mind.  I may be wrong
> here, but when he noted:
>
> 'We also use XFS on linux 2.6 as a file system, so the FS should be
> fairly tolerant to power-outages.'
>
> Is Andrew indicating here that there might be some issues with power
> loss on some of these boxes?  If so, is it reasonable to assume that the
> filesystem is able to maintain the database integrity in such a power
> loss?  I understand that XFS is quite a robust file system, but I can't
> see relying on such robustness for database integrity (or any file
> integrity for that matter).  UPS's might be a better solution.

If I were him I'd try running my database on a different file system to
see if his version of XFS might be causing these problems.

While I agree that frequent power loss is NOT something a database
should be exposed to, a properly setup machine with a properly
functioning journalling file system should not experience these
problems.  Might be time to check the drive subsystem to make sure it's
properly fsyncing data.

Re: Lost rows/data corruption?

From
Marco Colombo
Date:
On Tue, 15 Feb 2005, Andrew Hall wrote:

>
>
>>> It sounds like a mess, all right.  Do you have a procedure to follow to
>>> replicate this havoc?  Are you sure there's not a hardware problem
>>> underlying it all?
>>>
>>> regards, tom lane
>>>
>
> We haven't been able to isolate what causes it but it's unlikely to be
> hardware as it happens on quite a few of our customer's boxes. We also use
> XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
> power-outages. Any ideas as to how I might go about isolating this? Have you
> heard any other reports of this kind and suggested remedies?

Are you running with fsync = off? and did the hosts experience any
power-outage recently?

.TM.
--
       ____/  ____/   /
      /      /       /            Marco Colombo
     ___/  ___  /   /              Technical Manager
    /          /   /             ESI s.r.l.
  _____/ _____/  _/               Colombo@ESI.it

Re: Lost rows/data corruption?

From
"Andrew Hall"
Date:
fsync is on for all these boxes. Our customers run their own hardware with
many different specification of hardware in use. Many of our customers don't
have UPS, although their power is probably pretty reliable (normal city
based utilities), but of course I can't guarantee they don't get an outage
once in a while with a thunderstorm etc.

The problem here is that we are consistently seeing the same kind of
corruption and symptoms across a fairly large number of customers (52 have
reported this problem), so there is something endemic happening here that to
be honest, I'm surprised no one else is seeing. Fundamentally there is
nothing particularly abnormal with our application or data, but regardless,
I would have thought these kind of things (application design, data
representation etc) irrelevant to the reliability of the database not to
allow duplicate data on a primary key. Something is causing this corruption,
and one thing we do know is that it doesn't happen immediately with a new
installation, it takes time (several months of usage) before we start to see
this condition. I'd be really surprised if XFS is the problem as I know
there are plenty of other people across the world using it reliability with
PG.

We're going to see if we can build a test environment that can forcibly
cause this but I don't hold much hope, as we've tried to isolate it before
with little success. Here's what we tried changing when we originally went
searching for the problem, and it still here:

- the hardware (tried single CPU instead of dual - though that maybe an
issue with the OS)
- the OS version (tried Linux 2.6.5, 2.6.6, 2.6.7, 2.6.8.1, 2.6.10 and
2.4.22) - all using XFS
- the database table layout (tried changing the way the data is stored)
- the version of Jetty (servlet engine)
- the DB pool manager and PG JDBC driver versions
- the version of PG (tried two or three back from the latest)
- various vacuum regimes


----- Original Message -----
From: "Marco Colombo" <pgsql@esiway.net>
To: "Andrew Hall" <temp02@bluereef.com.au>
Cc: <pgsql-general@postgresql.org>
Sent: Wednesday, February 16, 2005 2:58 AM
Subject: Re: Lost rows/data corruption?


> On Tue, 15 Feb 2005, Andrew Hall wrote:
>
>>
>>
>>>> It sounds like a mess, all right.  Do you have a procedure to follow to
>>>> replicate this havoc?  Are you sure there's not a hardware problem
>>>> underlying it all?
>>>>
>>>> regards, tom lane
>>>>
>>
>> We haven't been able to isolate what causes it but it's unlikely to be
>> hardware as it happens on quite a few of our customer's boxes. We also
>> use
>> XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
>> power-outages. Any ideas as to how I might go about isolating this? Have
>> you
>> heard any other reports of this kind and suggested remedies?
>
> Are you running with fsync = off? and did the hosts experience any
> power-outage recently?
>
> .TM.
> --
>       ____/  ____/   /
>      /      /       / Marco Colombo
>     ___/  ___  /   /       Technical Manager
>    /          /   / ESI s.r.l.
>  _____/ _____/  _/        Colombo@ESI.it
>


Re: Lost rows/data corruption?

From
Marco Colombo
Date:
On Wed, 16 Feb 2005, Andrew Hall wrote:

> fsync is on for all these boxes. Our customers run their own hardware with
> many different specification of hardware in use. Many of our customers don't
> have UPS, although their power is probably pretty reliable (normal city based
> utilities), but of course I can't guarantee they don't get an outage once in
> a while with a thunderstorm etc.

I see. Well I can't help much, then, I don't run PG on XFS. I suggest testing
on a different FS, to exclude XFS problems. But with fsync on, the FS has
very little to do with reliability, unless it _lies_ about fsync(). Any
FS should return from fsync only after data is on disc, journal or not
(there might be issues with meta-data, but it's hardly a problem with PG).

It's more likely the hardware (IDE disks) lies about data being on plate.
But again that's only in case of sudden poweroffs.

[...]
> this condition. I'd be really surprised if XFS is the problem as I know there
> are plenty of other people across the world using it reliability with PG.

This is kind of OT, but I don't follow your logic here.

I don't see why plenty of success stories of XFS+PG suggest to you
the culprit is PG. To me it's still 50% - 50%. :-)

Moreover, XFS is continuosly updated (as it follows normal linux kernel
fast release cycle, like any other linux FS), so it's hard to make a
data point unless someone else is using _exactly_ the same versions as
you do.

For example, in kernel changelog from 2.6.7 to 2.6.10 you can read:

"[XFS] Fix a race condition in the undo-delayed-write buffer routine."

"[XFS] Fix up memory allocators to be more resilient."

"[XFS] Fix a possible data loss issue after an unaligned unwritten
  extent write."

"[XFS] handle inode creating race"

(only a few of them)

Now, I don't have even the faintest idea if that might have affected you
or nor, but still the point is that the linux kernel changes a lot.
And vendors tend to customize their kernels a lot, too. On the PostreSQL
side, releases are slowly-paced, so it's easier.


Anyway, I agree your problem is weird, and that it must be something
on the server side.
No matter what you do on the client side (pool manager, JDBC driver,
servlets engige), in no way the DB should get corrupted with duplicated
primary keys.

I know this is a silly question, but when you write 'We do nothing with
any indexes' do you mean indeces are never, _never_ touched (I mean
explicitly, as in drop/create index), i.e. they are created at schema
creation time and then left alone? Just to make sure...

.TM.
--
       ____/  ____/   /
      /      /       /            Marco Colombo
     ___/  ___  /   /              Technical Manager
    /          /   /             ESI s.r.l.
  _____/ _____/  _/               Colombo@ESI.it

Re: Lost rows/data corruption?

From
Alban Hertroys
Date:
Marco Colombo wrote:
> On Wed, 16 Feb 2005, Andrew Hall wrote:
>
>> fsync is on for all these boxes. Our customers run their own hardware
>> with many different specification of hardware in use. Many of our
>> customers don't have UPS, although their power is probably pretty
>> reliable (normal city based utilities), but of course I can't
>> guarantee they don't get an outage once in a while with a thunderstorm
>> etc.
>
>
> I see. Well I can't help much, then, I don't run PG on XFS. I suggest
> testing
> on a different FS, to exclude XFS problems. But with fsync on, the FS has
> very little to do with reliability, unless it _lies_ about fsync(). Any
> FS should return from fsync only after data is on disc, journal or not
> (there might be issues with meta-data, but it's hardly a problem with PG).
>
> It's more likely the hardware (IDE disks) lies about data being on plate.
> But again that's only in case of sudden poweroffs.

Do you happen to have the same type disks in all these systems? That
could point to a disk cache "problem" (f.e. the disks lying about having
written data from the cache to disk).

Or do you use the same disk parameters on all these machines? Have you
tried using the disks w/o write caching and/or in synchronous mode
(contrary to "async").

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban@magproductions.nl
W: http://www.magproductions.nl

Re: Lost rows/data corruption?

From
Scott Marlowe
Date:
On Wed, 2005-02-16 at 07:14, Alban Hertroys wrote:
> Marco Colombo wrote:
> > On Wed, 16 Feb 2005, Andrew Hall wrote:
> >
> >> fsync is on for all these boxes. Our customers run their own hardware
> >> with many different specification of hardware in use. Many of our
> >> customers don't have UPS, although their power is probably pretty
> >> reliable (normal city based utilities), but of course I can't
> >> guarantee they don't get an outage once in a while with a thunderstorm
> >> etc.
> >
> >
> > I see. Well I can't help much, then, I don't run PG on XFS. I suggest
> > testing
> > on a different FS, to exclude XFS problems. But with fsync on, the FS has
> > very little to do with reliability, unless it _lies_ about fsync(). Any
> > FS should return from fsync only after data is on disc, journal or not
> > (there might be issues with meta-data, but it's hardly a problem with PG).
> >
> > It's more likely the hardware (IDE disks) lies about data being on plate.
> > But again that's only in case of sudden poweroffs.
>
> Do you happen to have the same type disks in all these systems? That
> could point to a disk cache "problem" (f.e. the disks lying about having
> written data from the cache to disk).
>
> Or do you use the same disk parameters on all these machines? Have you
> tried using the disks w/o write caching and/or in synchronous mode
> (contrary to "async").

I was wondering if this problem had ever shown up on a machine that
HADN'T lost power abrubtly or not.  IFF the only machines that
experience corruption have lost power beforehand sometime, then I would
look towards either the drives, controller or file system or somewhere
in there.

I know there are write modes in ext3 that will allow corruption on power
loss (I think it's writeback).  I know little of XFS in a production
environment, as I run ext3, warts and all.

Re: Lost rows/data corruption?

From
Marco Colombo
Date:
On Wed, 16 Feb 2005, Scott Marlowe wrote:

> I know there are write modes in ext3 that will allow corruption on power
> loss (I think it's writeback).  I know little of XFS in a production
> environment, as I run ext3, warts and all.

Yeah, but even in writeback mode, ext3 doesn't lie on fsync. No FS does.

Since PG can't expect any data to be on disk _before_ fsync completes,
it doesn't really make a difference. You can loose data in writeback mode
_if_ the application is not fsync-ing it (XFS only "mode" is similar to
writeback). I'm not aware of any case in which the system can lie about
fsync(), unless the hardware is lying in turn.

One question for gurus: does PG use fsync() on dirty data pages when
they are flushed to disk at checkpoint time? Does it fsync() the
directory in case of file creation/deletion/rename?

.TM.
--
       ____/  ____/   /
      /      /       /            Marco Colombo
     ___/  ___  /   /              Technical Manager
    /          /   /             ESI s.r.l.
  _____/ _____/  _/               Colombo@ESI.it

Re: Lost rows/data corruption?

From
"Andrew Hall"
Date:
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Andrew Hall" <temp02@bluereef.com.au>
Cc: <pgsql-general@postgresql.org>
Sent: Tuesday, February 15, 2005 3:25 PM
Subject: Re: [GENERAL] Lost rows/data corruption?


> "Andrew Hall" <temp02@bluereef.com.au> writes:
>> Here is the data you requested. It took little while to gather it as this
>> kind of corruption doesn't happen all the time.
>
> It sounds like a mess, all right.  Do you have a procedure to follow to
> replicate this havoc?  Are you sure there's not a hardware problem
> underlying it all?
>
> regards, tom lane
>

We haven't been able to isolate what causes it but it's unlikely to be
hardware as it happens on quite a few of our customer's boxes. We also use
XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
power-outages. Any ideas as to how I might go about isolating this? Have you
heard any other reports of this kind and suggested remedies?


Re: Lost rows/data corruption?

From
"Andrew Hall"
Date:
> I know this is a silly question, but when you write 'We do nothing with
> any indexes' do you mean indeces are never, _never_ touched (I mean
> explicitly, as in drop/create index), i.e. they are created at schema
> creation time and then left alone? Just to make sure...

Hi and thanks for your feedback,

Yes we never touch them, as in, they are implicitly created at schema create
time and then we don't touch them.


Re: Lost rows/data corruption?

From
"Andrew Hall"
Date:
> Do you happen to have the same type disks in all these systems? That could
> point to a disk cache "problem" (f.e. the disks lying about having written
> data from the cache to disk).
>
> Or do you use the same disk parameters on all these machines? Have you
> tried using the disks w/o write caching and/or in synchronous mode
> (contrary to "async").

It's all pretty common stuff, quite a few customers use standard IDE
(various flavours of controller/disk), some now use SATA (again various
brands) and the rest use SCSI. The kernel we use is the standard Linus
approved kernel with the inbuilt drivers as part of the kernel. We don't
supply any non-default parameters to the disk controllers.

Thanks for your suggestion on write caching, I'll look into this, I'm also
tempted to try a different journalling FS too.


Re: Lost rows/data corruption?

From
"Andrew Hall"
Date:
> I was wondering if this problem had ever shown up on a machine that
> HADN'T lost power abrubtly or not.  IFF the only machines that
> experience corruption have lost power beforehand sometime, then I would
> look towards either the drives, controller or file system or somewhere
> in there.

I can't be sure. We have an automated maintenance process that reboots all
our customers machines every 10 days at 2am. Having said this I'm now
wondering if this may have something to do with the issue. This automated
process issues a 'shutdown' to the database (and all other processes), waits
20 seconds and then issues a 'reboot' to the kernel. If the database was
still processing, the active postmaster process may wait for the client to
complete the query before allowing it to close, but I'm assuming that if
this exceeds 20 seconds, the kernel will issue a 'sigquit' to the process
tree and reboot immediately. Could this cause corruption?


Re: Lost rows/data corruption?

From
Michael Fuhr
Date:
On Thu, Feb 17, 2005 at 07:40:25PM +1100, Andrew Hall wrote:
>
> We have an automated maintenance process that reboots all our
> customers machines every 10 days at 2am.

What's the purpose of doing this?  If it's necessary then the reboots
aren't really fixing anything.  Is whatever problem that prompted
this procedure being investigated so a permanent fix can be applied?

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: Lost rows/data corruption?

From
Tom Lane
Date:
"Andrew Hall" <temp02@bluereef.com.au> writes:
> I can't be sure. We have an automated maintenance process that reboots all
> our customers machines every 10 days at 2am.

Why?  Sounds like a decision made by someone who is used to Windows.
I've never seen any variant of Unix that needed that.

            regards, tom lane

Re: Lost rows/data corruption?

From
"Andrew Hall"
Date:
We do the maintenance reboot (and other various log cleanups etc) as part of
our normal maintenance practice. We don't really 'need' to do this, however
we've traditionally found that operating systems perform better with an
occassional reboot (cleanup fragmented memory etc.).

----- Original Message -----
From: "Michael Fuhr" <mike@fuhr.org>
To: "Andrew Hall" <temp02@bluereef.com.au>
Cc: "Scott Marlowe" <smarlowe@g2switchworks.com>; "Alban Hertroys"
<alban@magproductions.nl>; "Marco Colombo" <pgsql@esiway.net>;
<pgsql-general@postgresql.org>
Sent: Friday, February 18, 2005 1:56 AM
Subject: Re: [GENERAL] Lost rows/data corruption?


> On Thu, Feb 17, 2005 at 07:40:25PM +1100, Andrew Hall wrote:
>>
>> We have an automated maintenance process that reboots all our
>> customers machines every 10 days at 2am.
>
> What's the purpose of doing this?  If it's necessary then the reboots
> aren't really fixing anything.  Is whatever problem that prompted
> this procedure being investigated so a permanent fix can be applied?
>
> --
> Michael Fuhr
> http://www.fuhr.org/~mfuhr/
>


Re: Lost rows/data corruption?

From
Scott Marlowe
Date:
A unix box or a mainframe that needs rebooting is generally considered
broken.

On Thu, 2005-02-17 at 17:11, Andrew Hall wrote:
> We do the maintenance reboot (and other various log cleanups etc) as part of
> our normal maintenance practice. We don't really 'need' to do this, however
> we've traditionally found that operating systems perform better with an
> occassional reboot (cleanup fragmented memory etc.).
>
> ----- Original Message -----
> From: "Michael Fuhr" <mike@fuhr.org>
> To: "Andrew Hall" <temp02@bluereef.com.au>
> Cc: "Scott Marlowe" <smarlowe@g2switchworks.com>; "Alban Hertroys"
> <alban@magproductions.nl>; "Marco Colombo" <pgsql@esiway.net>;
> <pgsql-general@postgresql.org>
> Sent: Friday, February 18, 2005 1:56 AM
> Subject: Re: [GENERAL] Lost rows/data corruption?
>
>
> > On Thu, Feb 17, 2005 at 07:40:25PM +1100, Andrew Hall wrote:
> >>
> >> We have an automated maintenance process that reboots all our
> >> customers machines every 10 days at 2am.
> >
> > What's the purpose of doing this?  If it's necessary then the reboots
> > aren't really fixing anything.  Is whatever problem that prompted
> > this procedure being investigated so a permanent fix can be applied?
> >
> > --
> > Michael Fuhr
> > http://www.fuhr.org/~mfuhr/
> >
>

Re: Lost rows/data corruption?

From
"Keith C. Perry"
Date:
Quoting Andrew Hall <temp02@bluereef.com.au>:

> > Do you happen to have the same type disks in all these systems? That could
>
> > point to a disk cache "problem" (f.e. the disks lying about having written
>
> > data from the cache to disk).
> >
> > Or do you use the same disk parameters on all these machines? Have you
> > tried using the disks w/o write caching and/or in synchronous mode
> > (contrary to "async").
>
> It's all pretty common stuff, quite a few customers use standard IDE
> (various flavours of controller/disk), some now use SATA (again various
> brands) and the rest use SCSI. The kernel we use is the standard Linus
> approved kernel with the inbuilt drivers as part of the kernel. We don't
> supply any non-default parameters to the disk controllers.
>
> Thanks for your suggestion on write caching, I'll look into this, I'm also
> tempted to try a different journalling FS too.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

I'm a little late on this thread but in regards to the SATA support.  2.4.29 in
my experience is really the first kernel that decent SATA support (i.e. much
better data throughput).  I think that would corresponse to 2.6.9 or .10 but
even before you get into all that.  I am curious to know what do you mean by
"standard Linus kernel".  Do you not compile your own kernels for the hardware
platform being used?

--
Keith C. Perry, MS E.E.
Director of Networks & Applications
VCSN, Inc.
http://vcsn.com

____________________________________
This email account is being host by:
VCSN, Inc : http://vcsn.com

Re: Lost rows/data corruption?

From
"Andrew Hall"
Date:
Yes, we compile our own kernel based on the "stardardised" stable release
available at the time. Everything we need is compiled in. This is what I
mean by standard Linus approved kernel release (as opposed to an AC/MM
modified release etc.)

----- Original Message -----
From: "Keith C. Perry" <netadmin@vcsn.com>
To: "Andrew Hall" <temp02@bluereef.com.au>
Cc: "Alban Hertroys" <alban@magproductions.nl>; "Marco Colombo"
<pgsql@esiway.net>; <pgsql-general@postgresql.org>
Sent: Saturday, February 26, 2005 6:02 AM
Subject: Re: [GENERAL] Lost rows/data corruption?


> Quoting Andrew Hall <temp02@bluereef.com.au>:
>
>> > Do you happen to have the same type disks in all these systems? That
>> > could
>>
>> > point to a disk cache "problem" (f.e. the disks lying about having
>> > written
>>
>> > data from the cache to disk).
>> >
>> > Or do you use the same disk parameters on all these machines? Have you
>> > tried using the disks w/o write caching and/or in synchronous mode
>> > (contrary to "async").
>>
>> It's all pretty common stuff, quite a few customers use standard IDE
>> (various flavours of controller/disk), some now use SATA (again various
>> brands) and the rest use SCSI. The kernel we use is the standard Linus
>> approved kernel with the inbuilt drivers as part of the kernel. We don't
>> supply any non-default parameters to the disk controllers.
>>
>> Thanks for your suggestion on write caching, I'll look into this, I'm
>> also
>> tempted to try a different journalling FS too.
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 2: you can get off all lists at once with the unregister command
>>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>>
>
> I'm a little late on this thread but in regards to the SATA support.
> 2.4.29 in
> my experience is really the first kernel that decent SATA support (i.e.
> much
> better data throughput).  I think that would corresponse to 2.6.9 or .10
> but
> even before you get into all that.  I am curious to know what do you mean
> by
> "standard Linus kernel".  Do you not compile your own kernels for the
> hardware
> platform being used?
>
> --
> Keith C. Perry, MS E.E.
> Director of Networks & Applications
> VCSN, Inc.
> http://vcsn.com
>
> ____________________________________
> This email account is being host by:
> VCSN, Inc : http://vcsn.com
>