Thread: Postgresql 8.2.4 on linux-sparc problem
Hello, all. I have a Postgresql 8.2.5 running on gentoo linux on sun fire v240 server with storage 3310. And sometimes I have a problem: time to time one postgres process starts using all CPU time. I can't kill this process. I can't stop postgres. I can't shutdown server at all from OS, only reset helps. When it happen (CPU usage is 100%) for about 30 minutes DB works (new connections may establish, queries works fine). But in 30 minutes DB stop responding. I understand that this description can't give you enough information to give an advice what to do. But I don't know what to collect from OS/ PG when it will happen again. Could you give me an advice what should I do when next time one of process will use all CPU time? I guess that I can collect stack from this process at that time by using GDB or something like it. But I have no experience how exactly to do it. About DB. It is about 40Gb statistical DB. Every minute some processes inserts data collected from different sources. The biggest part of types of statements from clients is INSERT. Also, I have a WEB interface to this DB which shows a lot of graphs. I have some aggregation functions in PL/PGSQL which SELECT info and then INSERT or UPDATE it in some tables. I never use DELETE (except some cases. but I'm sure that problem appears not while DELETE). Every day scripts inserts data from the same sources via the same queries. But problem appears at different time. More info: postgres=# show server_version; server_version ---------------- 8.2.4 machupicchu ~ # uname -a Linux machupicchu 2.6.20-gentoo-r8 #1 SMP Fri Jun 22 09:51:20 MSD 2007 sparc64 sun4u TI UltraSparc IIIi (Jalapeno) GNU/Linux machupicchu pgdata # grep -v "^#" postgresql.conf | sort -u autovacuum = on autovacuum_analyze_scale_factor = 0.1 autovacuum_analyze_threshold = 250 autovacuum_naptime = 10min autovacuum_vacuum_scale_factor = 0.2 autovacuum_vacuum_threshold = 500 client_min_messages = notice constraint_exclusion = on datestyle = 'iso, mdy' enable_bitmapscan = on enable_hashagg = on enable_hashjoin = on enable_indexscan = on enable_mergejoin = on enable_nestloop = on enable_seqscan = on enable_sort = on enable_tidscan = on lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' listen_addresses = '*' log_directory = 'pg_log' log_duration = on log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' log_line_prefix = '%m, %s, %r, %p, ' log_min_duration_statement = 1000 log_min_error_statement = error log_min_messages = error log_statement = 'all' log_truncate_on_rotation = off maintenance_work_mem = 1572864 max_connections = 100 max_fsm_pages = 153600 redirect_stderr = on shared_buffers = 24MB stats_block_level = on stats_command_string = on stats_row_level = on stats_start_collector = on work_mem = 1048576
Zelinskiy Alexander <alexander.zelinskiy@gmail.com> writes: > I understand that this description can't give you enough information > to give an advice what to do. But I don't know what to collect from OS/ > PG when it will happen again. Could you give me an advice what should > I do when next time one of process will use all CPU time? Before you get into that, try reducing these parameters: > maintenance_work_mem = 1572864 > work_mem = 1048576 They are way too high, especially the second one. But if that doesn't help: > I guess that I can collect stack from this process at that time by > using GDB or something like it. But I have no experience how exactly > to do it. Yup, that's the thing to do when you don't know what's happening. The procedure is: Determine PID of the troublesome backend (use ps or whatever). As root or the postgres user, do $ gdb /path/to/postgres-executable PID ... introductory noise, but the list of libraries can be useful ... gdb> bt ... useful info here ... gdb> quit OK to detach? y $ Necessary advance preparation: be sure gdb is installed and you have a debug-enabled postgres (install debuginfo RPM if needed). I'd suggest practicing and making sure that the backtrace shows routine names and parameter values. As an example, a backtrace I was just looking at looked in part like this: #0 0x000000310202e174 in xmlCleanupCharEncodingHandlers () from /usr/lib64/libxml2.so.2 #1 0x0000003102036725 in xmlCleanupParser () from /usr/lib64/libxml2.so.2 #2 0x0000000000660e39 in xpath (fcinfo=<value optimized out>) at xml.c:3452 #3 0x000000000053601e in ExecMakeFunctionResult (fcache=0xb32f60, econtext=0xb32e68, isNull=0xb33ed8 "", isDone=0xb33f90) at execQual.c:1351 #4 0x00000000005343c6 in ExecProject (projInfo=<value optimized out>, isDone=0x0) at execQual.c:4601 #5 0x000000000053f5a5 in ExecAgg (node=0xb32b58) at nodeAgg.c:989 #6 0x0000000000533eaa in ExecProcNode (node=0xb32b58) at execProcnode.c:394 I do not have debuginfo installed for libxml2, thus the first two lines provide very little information. The rest look more like what a developer would like to see. regards, tom lane
Lew wrote: > Tom Lane wrote: >> Before you get into that, try reducing these parameters: >> >>> maintenance_work_mem = 1572864 >>> work_mem = 1048576 >> >> They are way too high, especially the second one. > > So if 1.5 MB maintenance_work_mem and 1 MB work_mem are "way too high", > why are the default values in the postgresql.conf on installation 16 MB > and 1 MB, respectively? Replying to my own question - I must learn to check docs *before* asking: <http://www.postgresql.org/docs/8.2/interactive/config-setting.html> > Some settings specify a memory or time value. Each of these has an > implicit unit, which is either kilobytes, blocks (typically eight > kilobytes), milliseconds, seconds, or minutes. I was off by a factor of 1024 in how I read that. The OP had specified 1.5 *GB* and 1 *GB* respectively! -- Lew
Tom Lane wrote: > Before you get into that, try reducing these parameters: > >> maintenance_work_mem = 1572864 >> work_mem = 1048576 > > They are way too high, especially the second one. So if 1.5 MB maintenance_work_mem and 1 MB work_mem are "way too high", why are the default values in the postgresql.conf on installation 16 MB and 1 MB, respectively? -- Lew
Hello, Tom. One again it happen. postgres 11009 1 0 Jan12 ? 00:00:12 /usr/bin/ postmaster -D /data/idx/pgdata --silent-mode=true postgres 11027 11009 0 Jan12 ? 00:26:55 postgres: logger process postgres 11029 11009 0 Jan12 ? 00:00:21 postgres: writer process postgres 11030 11009 0 Jan12 ? 00:16:50 postgres: stats collector process postgres 16751 11009 1 20:40 ? 00:00:12 postgres: stat stat 10.0.0.2(41239) idle postgres 16753 11009 0 20:40 ? 00:00:11 postgres: stat stat 10.0.0.2(41244) idle postgres 16758 11009 3 20:41 ? 00:00:35 postgres: stat stat 10.0.0.2(50546) SELECT postgres 16760 11009 0 20:42 ? 00:00:00 postgres: stat stat 10.0.0.2(50573) idle postgres 16761 11009 99 20:42 ? 00:16:59 postgres: stat stat 10.0.0.2(50577) idle postgres 16762 11009 0 20:43 ? 00:00:00 postgres: stat stat 10.0.0.2(50603) INSERT I tried to use gdb but no success. machupicchu ~ # gdb /usr/bin/postgres 16761 GNU gdb 6.6 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "sparc-unknown-linux-gnu"... Using host libthread_db library "/lib/libthread_db.so.1". Attaching to program: /usr/bin/postgres, process 16761 .... I enabled full logging to find a query after which pg die. From postgresql.conf: log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' log_truncate_on_rotation = off log_min_messages = error log_min_error_statement = error log_min_duration_statement = 1000 log_duration = on log_line_prefix = '%m, %s, %r, %p, ' log_statement = 'all' But I can't find nothing about PID 16761 or tcp port 50577 in logs! I can find neighbour, but not this process. Else: # vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ---- cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 4 24 1408 25832 180800 3738664 0 0 5 23 31 5 8 1 84 6 4 24 1408 25832 180800 3738664 0 0 0 0 1010 14 0 50 0 50 4 24 1408 25832 180800 3738664 0 0 0 0 1005 11 0 50 0 50 4 24 1408 25832 180800 3738664 0 0 0 0 1005 11 0 50 0 50 4 24 1408 25832 180800 3738664 0 0 0 0 1004 11 0 50 0 50 4 24 1408 25832 180800 3738664 0 0 0 0 1005 13 0 50 0 50 4 24 1408 25832 180800 3738664 0 0 0 0 1004 13 0 50 0 50 4 24 1408 25832 180800 3738664 0 0 0 0 1004 11 0 50 0 50 Looks like problem is in IO-wait? What to do? Where to dig? In some weeks a want to migrate to 2 redundant sun fire v440 servers with 4 storages. Main idea is to have HW redundancy. But now.. Don't know what to say to my boss. HW is fine, but a lot of data loss.. :) p.s. I commented my configuration for _mem options. Now it's default.
Zelinskiy Alexander <alexander.zelinskiy@gmail.com> writes: > One again it happen. Looking back at your original post ... maybe you should try a non-Gentoo kernel. Gentoo doesn't have the greatest reputation for stability, and on a non-mainstream architecture like sparc64 it'd likely be even buggier than its usual reputation. The symptoms you cite are consistent with a disk operation getting stuck in the kernel, and a system that locks up to the point of requiring a hard reset sounds even more like a kernel problem. regards, tom lane