Thread: 8.1.3 and unused files
Am I correct in the thought that the various files listed below are not used by the database and can be safely removed? There were no other active db connections when I issued this command. I think truncate (Slony) left them behind. ssdb=# select filefrom pg_ls_dir('base/'|| (select oid from pg_database where datname = 'ssdb')) as tab(file) where file !~ '\\..*$'and file not in (select relfilenode from pg_class)and file not in ('PG_VERSION','pgsql_tmp'); file ---------143498614349841434985 (3 rows) [root@DB1 16384]# ls -la 143498[456]* -rw------- 1 rbt sysadmin 1073741824 May 1 20:56 1434984 -rw------- 1 rbt sysadmin 1073741824 May 1 21:11 1434984.1 -rw------- 1 rbt sysadmin 1073741824 May 1 23:21 1434984.10 -rw------- 1 rbt sysadmin 1073741824 May 1 23:36 1434984.11 -rw------- 1 rbt sysadmin 1073741824 May 1 23:50 1434984.12 -rw------- 1 rbt sysadmin 1073741824 May 2 00:06 1434984.13 -rw------- 1 rbt sysadmin 1073741824 May 2 00:23 1434984.14 -rw------- 1 rbt sysadmin 1073741824 May 2 00:39 1434984.15 -rw------- 1 rbt sysadmin 1073741824 May 2 00:57 1434984.16 -rw------- 1 rbt sysadmin 1073741824 May 2 01:14 1434984.17 -rw------- 1 rbt sysadmin 1073741824 May 2 01:31 1434984.18 -rw------- 1 rbt sysadmin 1073741824 May 2 01:50 1434984.19 -rw------- 1 rbt sysadmin 1073741824 May 1 21:25 1434984.2 -rw------- 1 rbt sysadmin 1073741824 May 2 02:07 1434984.20 -rw------- 1 rbt sysadmin 1073741824 May 2 02:23 1434984.21 -rw------- 1 rbt sysadmin 1073741824 May 2 02:41 1434984.22 -rw------- 1 rbt sysadmin 1073741824 May 2 02:55 1434984.23 -rw------- 1 rbt sysadmin 1073741824 May 2 03:09 1434984.24 -rw------- 1 rbt sysadmin 1073741824 May 2 03:24 1434984.25 -rw------- 1 rbt sysadmin 1073741824 May 2 03:37 1434984.26 -rw------- 1 rbt sysadmin 1073741824 May 2 03:53 1434984.27 -rw------- 1 rbt sysadmin 1073741824 May 2 04:09 1434984.28 -rw------- 1 rbt sysadmin 1073741824 May 2 04:24 1434984.29 -rw------- 1 rbt sysadmin 1073741824 May 1 21:40 1434984.3 -rw------- 1 rbt sysadmin 1073741824 May 2 04:40 1434984.30 -rw------- 1 rbt sysadmin 1073741824 May 2 04:56 1434984.31 -rw------- 1 rbt sysadmin 990912512 May 2 05:09 1434984.32 -rw------- 1 rbt sysadmin 1073741824 May 1 21:54 1434984.4 -rw------- 1 rbt sysadmin 1073741824 May 1 22:08 1434984.5 -rw------- 1 rbt sysadmin 1073741824 May 1 22:23 1434984.6 -rw------- 1 rbt sysadmin 1073741824 May 1 22:36 1434984.7 -rw------- 1 rbt sysadmin 1073741824 May 1 22:52 1434984.8 -rw------- 1 rbt sysadmin 1073741824 May 1 23:07 1434984.9 -rw------- 1 rbt sysadmin 8192 May 1 20:40 1434985 -rw------- 1 rbt sysadmin 1073741824 May 2 11:27 1434986 -rw------- 1 rbt sysadmin 1073741824 May 2 11:39 1434986.1 -rw------- 1 rbt sysadmin 121733120 May 2 16:53 1434986.10 -rw------- 1 rbt sysadmin 1073741824 May 2 11:56 1434986.2 -rw------- 1 rbt sysadmin 1073741824 May 2 12:15 1434986.3 -rw------- 1 rbt sysadmin 1073741824 May 2 12:43 1434986.4 -rw------- 1 rbt sysadmin 1073741824 May 2 13:15 1434986.5 -rw------- 1 rbt sysadmin 1073741824 May 2 13:53 1434986.6 -rw------- 1 rbt sysadmin 1073741824 May 2 14:35 1434986.7 -rw------- 1 rbt sysadmin 1073741824 May 2 15:38 1434986.8 -rw------- 1 rbt sysadmin 1073741824 May 2 16:53 1434986.9 --
Rod Taylor <pg@rbt.ca> writes: > Am I correct in the thought that the various files listed below are not > used by the database and can be safely removed? There were no other > active db connections when I issued this command. > I think truncate (Slony) left them behind. I don't particularly like the casual assumption that truncate is broken. If I were you I'd be looking harder for a plausible explanation about where these files came from, especially seeing how large they are. Have you tried dumping the file contents to see if the data looks recognizable at all? regards, tom lane
On Fri, 2006-05-05 at 14:09 -0400, Tom Lane wrote: > Rod Taylor <pg@rbt.ca> writes: > > Am I correct in the thought that the various files listed below are not > > used by the database and can be safely removed? There were no other > > active db connections when I issued this command. > > > I think truncate (Slony) left them behind. > > I don't particularly like the casual assumption that truncate is broken. > If I were you I'd be looking harder for a plausible explanation about > where these files came from, especially seeing how large they are. > Have you tried dumping the file contents to see if the data looks > recognizable at all? Hardware is perfectly functional and has been for about 18 months in production with 8.0.x. It is a completely new 8.1 database and Slony is the only entity that has been working in it. There are not very many possibilities. I'm fairly confident I know exactly which table they are/were a part of. 1434984 is the table data, 1434986 is the primary key of the table (only index), and 1434985 is probably the toast structure. The structure have different relfilenode values and valid data at this time. At some point it must have failed in copying the data across, aborted, and restarted. So it would have been something like this: BEGIN; TRUNCATE; "decouple indexes -- ask Jan"; COPY; "recouple indexes"; REINDEX <crash, abort, something else to cause a Slony to restart>; <reconnect> BEGIN; TRUNCATE; "decouple indexes -- ask Jan"; COPY; "recouple indexes"; REINDEX; COMMIT; --
Rod Taylor <pg@rbt.ca> writes: > At some point it must have failed in copying the data across, aborted, > and restarted. Unless you had an actual backend crash, that's not an adequate explanation. Transaction abort does clean up created files. regards, tom lane
On Fri, 2006-05-05 at 14:31 -0400, Tom Lane wrote: > Rod Taylor <pg@rbt.ca> writes: > > At some point it must have failed in copying the data across, aborted, > > and restarted. > > Unless you had an actual backend crash, that's not an adequate > explanation. Transaction abort does clean up created files. The only reason I noticed is because pg_database_size didn't match sum(pg_total_relation_size()) and was investigating what I thought was a bug in one of those functions. I'm afraid we don't have all of the monitoring, logging, and change control bits hooked up to the non-production DBs, so that is pretty much all I have other than conjecture. The only thing I can come up with is that perhaps someone forcefully gave it a kick. SIGTERM is a necessary action once in a while to unwedge a stuck db connection (killing the client script doesn't always get it immediately). Slony holds open a transaction on the master while reindexing the slave, so perhaps someone thought the slave needed help. Making a copy of the master takes several weeks. They may have killed slony, found the statements still working away, SIGTERM'd them both, then restarted slony. It wouldn't be an unusual pattern of events, particularly since they've not been taught about pg_cancel_backend() yet (no 8.1 training). How about this? BEGIN; TRUNCATE; COPY; REINDEX <SIGTERM during REINDEX>; pg_class references old files. New files in their aborted state are left behind? --
Rod Taylor <pg@rbt.ca> writes: > On Fri, 2006-05-05 at 14:31 -0400, Tom Lane wrote: >> Unless you had an actual backend crash, that's not an adequate >> explanation. Transaction abort does clean up created files. > The only thing I can come up with is that perhaps someone forcefully > gave it a kick. SIGTERM is a necessary action once in a while to unwedge > a stuck db connection (killing the client script doesn't always get it > immediately). SIGTERM wouldn't cause that either. I hope your people are not in the habit of using "kill -9"? regards, tom lane
On Fri, 2006-05-05 at 15:10 -0400, Tom Lane wrote: > Rod Taylor <pg@rbt.ca> writes: > > On Fri, 2006-05-05 at 14:31 -0400, Tom Lane wrote: > >> Unless you had an actual backend crash, that's not an adequate > >> explanation. Transaction abort does clean up created files. > > > The only thing I can come up with is that perhaps someone forcefully > > gave it a kick. SIGTERM is a necessary action once in a while to unwedge > > a stuck db connection (killing the client script doesn't always get it > > immediately). > > SIGTERM wouldn't cause that either. I hope your people are not in the > habit of using "kill -9"? Command line records, etc. are not available, but I did track down a a snippet of logs from the backups (daily log rotation). Sorry, I didn't realize there were backups initially -- it's unusual. Appears it did crash or get killed in some way "exited with exit code 1". It's a temp DB to try a different database encoding (prep for 8.1 upgrade) with production data. Is there something you would like me to try doing in an attempt to reproduce? Preferably with a smaller structure. The truncate would have happened as part of the prepareTableForCopy() call. slony%ssdb 10171 4621947 2006-05-02 05:09:40 EDTLOG: 00000: duration: 30526368.316 ms statement: select "_ssrep".prepareTableForCopy(1010);copy <SNIP> from stdin; slony%ssdb 10171 4621947 2006-05-02 05:09:40 EDTLOCATION: exec_simple_query, postgres.c:1103 slony%ssdb 10171 4621947 2006-05-02 05:09:40 EDTLOCATION: exec_simple_query, postgres.c:1103 slony%ssdb 10181 0 2006-05-02 15:32:06 EDTLOG: 08P01: unexpected EOF on client connection slony%ssdb 10181 0 2006-05-02 15:32:06 EDTLOCATION: SocketBackend, postgres.c:295 slony%ssdb 10154 0 2006-05-02 15:32:06 EDTLOG: 08P01: unexpected EOF on client connection slony%ssdb 10154 0 2006-05-02 15:32:06 EDTLOCATION: SocketBackend, postgres.c:295 slony%ssdb 10173 0 2006-05-02 16:30:53 EDTLOG: 08P01: unexpected EOF on client connection slony%ssdb 10173 0 2006-05-02 16:30:53 EDTLOCATION: SocketBackend, postgres.c:295 slony%ssdb 10755 0 2006-05-02 16:30:53 EDTLOG: 08P01: unexpected EOF on client connection slony%ssdb 10755 0 2006-05-02 16:30:53 EDTLOCATION: SocketBackend, postgres.c:295 slony%ssdb 300 0 2006-05-02 16:55:18 EDTLOG: 08P01: unexpected EOF on client connection slony%ssdb 300 0 2006-05-02 16:55:18 EDTLOCATION: SocketBackend, postgres.c:295 slony%ssdb 301 0 2006-05-02 16:55:18 EDTLOG: 08P01: unexpected EOF on client connection slony%ssdb 301 0 2006-05-02 16:55:18 EDTLOCATION: SocketBackend, postgres.c:295 % 1960 2006-05-02 17:03:19 EDTLOG: 00000: server process (PID 10171) exited with exit code 1 % 1960 2006-05-02 17:03:19 EDTLOCATION: LogChildExit, postmaster.c:2416 % 1960 2006-05-02 17:03:19 EDTLOG: 00000: terminating any other active server processes % 1960 2006-05-02 17:03:19 EDTLOCATION: HandleChildCrash, postmaster.c:2306 % 1960 2006-05-02 17:03:19 EDTLOCATION: HandleChildCrash, postmaster.c:2306 % 1960 2006-05-02 17:03:19 EDTLOG: 00000: all server processes terminated; reinitializing % 1960 2006-05-02 17:03:19 EDTLOCATION: reaper, postmaster.c:2206 < snip connection attempts > % 5826 2006-05-02 17:03:22 EDTLOG: 00000: database system was interrupted at 2006-05-02 16:06:20 EDT % 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4374 % 5826 2006-05-02 17:03:22 EDTLOG: 00000: checkpoint record is at 59/E0B56920 % 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4442 % 5826 2006-05-02 17:03:22 EDTLOG: 00000: redo record is at 59/E0B56920; undo record is at 0/0; shutdown FALSE % 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4469 % 5826 2006-05-02 17:03:22 EDTLOG: 00000: next transaction ID: 4863932; next OID: 1441853 % 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4472 % 5826 2006-05-02 17:03:22 EDTLOG: 00000: next MultiXactId: 1; next MultiXactOffset: 0 % 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4475 % 5826 2006-05-02 17:03:22 EDTLOG: 00000: database system was not properly shut down; automatic recovery in progress % 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4532 % 5826 2006-05-02 17:03:22 EDTLOG: 00000: redo starts at 59/E0B56970 % 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4569 % 5826 2006-05-02 17:03:22 EDTLOG: 00000: record with zero length at 59/E0E429B8 % 5826 2006-05-02 17:03:22 EDTLOCATION: ReadRecord, xlog.c:2764 % 5826 2006-05-02 17:03:22 EDTLOG: 00000: redo done at 59/E0E42988 % 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4627 % 5826 2006-05-02 17:03:23 EDTLOG: 00000: database system is ready % 5826 2006-05-02 17:03:23 EDTLOCATION: StartupXLOG, xlog.c:4821 % 5826 2006-05-02 17:03:23 EDTLOG: 00000: transaction ID wrap limit is 1073749769, limited by database "ssdb" % 5826 2006-05-02 17:03:23 EDTLOCATION: SetTransactionIdLimit, varsup.c:234 --
Rod Taylor <pg@rbt.ca> writes: > % 1960 2006-05-02 17:03:19 EDTLOG: 00000: server process (PID 10171) exited with exit code 1 Hm. I wonder if there are any uses of "exit(1)" in the Slony triggers. regards, tom lane
On Friday 05 May 2006 13:11, Tom Lane wrote: > Rod Taylor <pg@rbt.ca> writes: > > % 1960 2006-05-02 17:03:19 EDTLOG: 00000: server process (PID 10171) > > exited with exit code 1 > > Hm. I wonder if there are any uses of "exit(1)" in the Slony triggers. No, there are no calls to exit() in the database loaded funcs. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Darcy Buskermolen Wavefire Technologies Corp. http://www.wavefire.com ph: 250.717.0200 fx: 250.763.1759
On Fri, 2006-05-05 at 16:11 -0400, Tom Lane wrote: > Rod Taylor <pg@rbt.ca> writes: > > % 1960 2006-05-02 17:03:19 EDTLOG: 00000: server process (PID 10171) exited with exit code 1 > > Hm. I wonder if there are any uses of "exit(1)" in the Slony triggers. It doesn't appear so. It does have this though: Datum _Slony_I_killBackend(PG_FUNCTION_ARGS) {int32 pid;int32 signo;text *signame; if (!superuser()) elog(ERROR, "Slony-I: insufficient privilege for killBackend"); pid = PG_GETARG_INT32(0);signame = PG_GETARG_TEXT_P(1); if (VARSIZE(signame) == VARHDRSZ + 4 && memcmp(VARDATA(signame), "NULL", 0) == 0){ signo = 0;}else if (VARSIZE(signame)== VARHDRSZ + 4 && memcmp(VARDATA(signame), "TERM", 0) == 0){ signo = SIGTERM;}else{ elog(ERROR,"Slony-I: unsupported signal");} if (kill(pid, signo) < 0) PG_RETURN_INT32(-1); PG_RETURN_INT32(0); } --
Rod Taylor <pg@rbt.ca> writes: > On Fri, 2006-05-05 at 16:11 -0400, Tom Lane wrote: >> Hm. I wonder if there are any uses of "exit(1)" in the Slony triggers. > It doesn't appear so. It does have this though: Well, a SIGTERM would have resulted in a bleat in the postmaster log. The striking thing about your log is that the backend went down without saying a word; which would be understandable if it had crashed (eg SEGV or kill -9) but then the postmaster would have seen some other exit status. I'm fairly certain there are no paths in the standard backend code that will exit(1) without any attempt to print a message. That's why I'm wondering about add-ons. regards, tom lane
On Fri, 2006-05-05 at 18:53 -0400, Tom Lane wrote: > Rod Taylor <pg@rbt.ca> writes: > > On Fri, 2006-05-05 at 16:11 -0400, Tom Lane wrote: > >> Hm. I wonder if there are any uses of "exit(1)" in the Slony triggers. > > > It doesn't appear so. It does have this though: > > Well, a SIGTERM would have resulted in a bleat in the postmaster log. > The striking thing about your log is that the backend went down without > saying a word; which would be understandable if it had crashed (eg SEGV > or kill -9) but then the postmaster would have seen some other exit > status. I'm fairly certain there are no paths in the standard backend > code that will exit(1) without any attempt to print a message. That's > why I'm wondering about add-ons. Add-ons are slim. Slony. We don't have any C based functions and only a few plpgsql functions in that DB. I did trim out a ton of autovacuum log entries (it likes to log once a minute) but I don't see anything interesting in that area nor autovac the pid that exited. My knowledge of signal handling is pretty basic. Any chance that multiple SIGTERMs could have caused it to avoid the logging? --