Thread: More along the lines of "WEIRD CRASH?!?!"?
While putting a test server (dual PIII 1GHz, 512M ram, ide single disk for system and $PGDATA) under high load, I keep running into: Mar 18 12:33:19 c257lx postgres[25277]: [6] NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index 'test_pkey' for table 'test' Mar 18 12:33:22 c257lx postgres[25273]: [22] LOG: recycled transaction log file 0000000F00000068 Mar 18 12:33:45 c257lx postgres[25277]: [7] PANIC: open of /home/db/pgsql/data/pg_clog/0406 failed: No such file o r directory Mar 18 12:33:46 c257lx postgres[15667]: [1] LOG: server process (pid 25277) was terminated by signal 6 This always happens several passes into a loop (second or third with default shared memory settings, much later with with 4096 shared buffers and 32 checkpoint segments). The workload in question goes like this: for each file in a zip archive: unzip $file | copy to test from stdin; index 'dry_quad_hold'; insert into table1 from select ... where ... (inner joins to "denormalize" the data) <3 of these in parallel subshells> ; add indexes to the "sub-tables" in 3 parallel subshells; loop over roughly 100 text files of 340,000 rows and 80 columns each. (It is designed to test my "LVM snapshots" script.) The workload being tested is handled without problem on production servers running 7.2.3 -- quad (dual) PIII Xeon 500 MHz w, 6 GB (2 GB) RAM, SCSI (IDE; "3ware") RAID5 decicated partition for $PGDATA. Searching for <str>postgresql "terminated by signal 6"</str> led me to this thread: http://archives.postgresql.org/pgsql-performance/2003-01/msg00322.php Is this a 7.3 regression? ... a memory size thing? I'm installing 7.2.4 on the test machine and will re-run the same script there. Thanks for help/suggestions, Murthy
Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes: > Mar 18 12:33:45 c257lx postgres[25277]: [7] PANIC: open of > /home/db/pgsql/data/pg_clog/0406 failed: No such file o > r directory > This always happens several passes into a loop (second or third with default > shared memory settings, much later with with 4096 shared buffers and 32 > checkpoint segments). Is it always complaining of the same pg_clog file name? Are you getting rid of the corruption between tries (eg with initdb), or is it just that you keep hitting the same pre-existing corrupted data? regards, tom lane
Tom Lane [mailto:tgl@sss.pgh.pa.us] writes: Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes: >> Mar 18 12:33:45 c257lx postgres[25277]: [7] PANIC: open of >> /home/db/pgsql/data/pg_clog/0406 failed: No such file o >> r directory > >> This always happens several passes into a loop (second or third with default >> shared memory settings, much later with 4096 shared buffers and 32 >> checkpoint segments). > >Is it always complaining of the same pg_clog file name? Are you getting >rid of the corruption between tries (eg with initdb), or is it just that >you keep hitting the same pre-existing corrupted data? I am not running initdb between tries to get rid of the corruption (I am dropping the tables that are created). However, the filename is not always the same (the repeated ones are almost certainly cases where I forgot to drop the tables between iterations): # less /var/log/postgresql | grep clog Mar 17 13:42:11 c257lx postgres[30988]: [7] PANIC: open of /home/db/pgsql/data/pg_clog/0001 failed: No such file or directory Mar 17 13:47:02 c257lx postgres[32111]: [10] PANIC: open of /home/db/pgsql/data/pg_clog/0001 failed: No such file or directory Mar 17 20:33:58 c257lx postgres[6483]: [7] PANIC: open of /home/db/pgsql/data/pg_clog/0B02 failed: No such file or directory Mar 17 23:52:12 c257lx postgres[8161]: [10] PANIC: open of /home/db/pgsql/data/pg_clog/0B02 failed: No such file or directory Mar 18 04:03:06 c257lx postgres[8628]: [13] PANIC: read of clog file 0, offset 147456 failed: Success Mar 18 09:25:28 c257lx postgres[10341]: [15] PANIC: read of clog file 0, offset 196608 failed: Success Mar 18 12:33:45 c257lx postgres[25277]: [7] PANIC: open of /home/db/pgsql/data/pg_clog/0406 failed: No such file or directory Mar 18 13:12:56 c257lx postgres[25398]: [69] PANIC: open of /home/db/pgsql/data/pg_clog/0730 failed: No such file or directory I can test with an initdb between iterations, if that would help. (I am in the process of setting up 7.2.4.) Thanks, Murthy
Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes: > I am not running initdb between tries to get rid of the corruption (I am > dropping the tables that are created). However, the filename is not always > the same (the repeated ones are almost certainly cases where I forgot to > drop the tables between iterations): Hm. This looks like ongoing data corruption to me --- the 'pg_clog' failure is often the first visible symptom of a trashed tuple header (or completely trashed page --- which is a case that we should detect earlier, but don't at the moment). Have you run memory and disk diagnostics recently on this machine? I think it's dropping data every so often. regards, tom lane
Tom Lane [mailto:tgl@sss.pgh.pa.us] writes: ... > >Have you run memory and disk diagnostics recently on this machine? >I think it's dropping data every so often. > Just the bios memory test at startup. I'll see what I can find on the disk diagnostics (its a recent IDE disk, so I guess anything's possible there). So much for testing LVM snapshots :-( (BTW, the workload I described is the worst case for us; I trust it of more general interest in testing the ability to take backups from snapshots under heavy load?) Thanks again, Murthy