Thread: More along the lines of "WEIRD CRASH?!?!"?

More along the lines of "WEIRD CRASH?!?!"?

From
Murthy Kambhampaty
Date:
While putting a test server (dual PIII 1GHz, 512M ram, ide single disk for
system and  $PGDATA) under high load, I keep running into:

Mar 18 12:33:19 c257lx postgres[25277]: [6] NOTICE:  ALTER TABLE / ADD
PRIMARY KEY will create implicit index 'test_pkey' for table 'test'
Mar 18 12:33:22 c257lx postgres[25273]: [22] LOG:  recycled transaction log
file 0000000F00000068
Mar 18 12:33:45 c257lx postgres[25277]: [7] PANIC:  open of
/home/db/pgsql/data/pg_clog/0406 failed: No such file o
r directory
Mar 18 12:33:46 c257lx postgres[15667]: [1] LOG:  server process (pid 25277)
was terminated by signal 6

This always happens several passes into a loop (second or third with default
shared memory settings, much later with with 4096 shared buffers and 32
checkpoint segments). The workload in question goes like this: for each file
in a zip archive: unzip $file | copy to test from stdin; index
'dry_quad_hold'; insert into table1 from select ... where ... (inner joins
to "denormalize" the data) <3 of these in parallel subshells> ; add indexes
to the "sub-tables" in 3 parallel subshells; loop over roughly 100 text
files of 340,000 rows and 80 columns each. (It is designed to test my "LVM
snapshots" script.)

The workload being tested is handled without problem on production servers
running 7.2.3 -- quad (dual) PIII Xeon 500 MHz w, 6 GB (2 GB) RAM, SCSI
(IDE; "3ware") RAID5 decicated partition for $PGDATA.

Searching for <str>postgresql "terminated by signal 6"</str> led me to this
thread:
http://archives.postgresql.org/pgsql-performance/2003-01/msg00322.php
Is this a 7.3 regression? ... a memory size thing? I'm installing 7.2.4 on
the test machine and will re-run the same script there.

Thanks for help/suggestions,
    Murthy

Re: More along the lines of "WEIRD CRASH?!?!"?

From
Tom Lane
Date:
Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes:
> Mar 18 12:33:45 c257lx postgres[25277]: [7] PANIC:  open of
> /home/db/pgsql/data/pg_clog/0406 failed: No such file o
> r directory

> This always happens several passes into a loop (second or third with default
> shared memory settings, much later with with 4096 shared buffers and 32
> checkpoint segments).

Is it always complaining of the same pg_clog file name?  Are you getting
rid of the corruption between tries (eg with initdb), or is it just that
you keep hitting the same pre-existing corrupted data?

            regards, tom lane

Re: More along the lines of "WEIRD CRASH?!?!"?

From
Murthy Kambhampaty
Date:
Tom Lane [mailto:tgl@sss.pgh.pa.us] writes:
Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes:
>> Mar 18 12:33:45 c257lx postgres[25277]: [7] PANIC:  open of
>> /home/db/pgsql/data/pg_clog/0406 failed: No such file o
>> r directory
>
>> This always happens several passes into a loop (second or third with
default
>> shared memory settings, much later with 4096 shared buffers and 32
>> checkpoint segments).
>
>Is it always complaining of the same pg_clog file name?  Are you getting
>rid of the corruption between tries (eg with initdb), or is it just that
>you keep hitting the same pre-existing corrupted data?

I am not running initdb between tries to get rid of the corruption (I am
dropping the tables that are created). However, the filename is not always
the same (the repeated ones are almost certainly cases where I forgot to
drop the tables between iterations):
# less /var/log/postgresql | grep clog
Mar 17 13:42:11 c257lx postgres[30988]: [7] PANIC:  open of
/home/db/pgsql/data/pg_clog/0001 failed: No such file or directory
Mar 17 13:47:02 c257lx postgres[32111]: [10] PANIC:  open of
/home/db/pgsql/data/pg_clog/0001 failed: No such file or directory
Mar 17 20:33:58 c257lx postgres[6483]: [7] PANIC:  open of
/home/db/pgsql/data/pg_clog/0B02 failed: No such file or directory
Mar 17 23:52:12 c257lx postgres[8161]: [10] PANIC:  open of
/home/db/pgsql/data/pg_clog/0B02 failed: No such file or directory
Mar 18 04:03:06 c257lx postgres[8628]: [13] PANIC:  read of clog file 0,
offset 147456 failed: Success
Mar 18 09:25:28 c257lx postgres[10341]: [15] PANIC:  read of clog file 0,
offset 196608 failed: Success
Mar 18 12:33:45 c257lx postgres[25277]: [7] PANIC:  open of
/home/db/pgsql/data/pg_clog/0406 failed: No such file or directory
Mar 18 13:12:56 c257lx postgres[25398]: [69] PANIC:  open of
/home/db/pgsql/data/pg_clog/0730 failed: No such file or directory

I can test with an initdb between iterations, if that would help. (I am in
the process of setting up 7.2.4.)

Thanks,
    Murthy

Re: More along the lines of "WEIRD CRASH?!?!"?

From
Tom Lane
Date:
Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes:
> I am not running initdb between tries to get rid of the corruption (I am
> dropping the tables that are created). However, the filename is not always
> the same (the repeated ones are almost certainly cases where I forgot to
> drop the tables between iterations):

Hm.  This looks like ongoing data corruption to me --- the 'pg_clog'
failure is often the first visible symptom of a trashed tuple header
(or completely trashed page --- which is a case that we should detect
earlier, but don't at the moment).

Have you run memory and disk diagnostics recently on this machine?
I think it's dropping data every so often.

            regards, tom lane

Re: More along the lines of "WEIRD CRASH?!?!"?

From
Murthy Kambhampaty
Date:
Tom Lane [mailto:tgl@sss.pgh.pa.us] writes:
...
>
>Have you run memory and disk diagnostics recently on this machine?
>I think it's dropping data every so often.
>
Just the bios memory test at startup. I'll see what I can find on the disk
diagnostics (its a recent IDE disk, so I guess anything's possible there).
So much for testing LVM snapshots :-( (BTW, the workload I described is the
worst case for us; I trust it of more general interest in testing the
ability to take backups from snapshots under heavy load?)

Thanks again,
    Murthy