Thread: Postgresql 8.2.4 on linux-sparc problem

Postgresql 8.2.4 on linux-sparc problem

From
Zelinskiy Alexander
Date:
Hello, all.

I have a Postgresql 8.2.5 running on gentoo linux on sun fire v240
server with storage 3310. And sometimes I have a problem: time to time
one postgres process starts using all CPU time. I can't kill this
process. I can't stop postgres. I can't shutdown server at all from
OS, only reset helps. When it happen (CPU usage is 100%) for about 30
minutes DB works (new connections may establish, queries works fine).
But in 30 minutes DB stop responding.

I understand that this description can't give you enough information
to give an advice what to do. But I don't know what to collect from OS/
PG when it will happen again. Could you give me an advice what should
I do when next time one of process will use all CPU time?

I guess that I can collect stack from this process at that time by
using GDB or something like it. But I have no experience how exactly
to do it.

About DB. It is about 40Gb statistical DB. Every minute some processes
inserts data collected from different sources. The biggest part of
types of statements from clients is INSERT. Also, I have a WEB
interface to this DB which shows a lot of graphs. I have some
aggregation functions in PL/PGSQL which SELECT info and then INSERT or
UPDATE it in some tables. I never use DELETE (except some cases. but
I'm sure that problem appears not while DELETE).

Every day scripts inserts data from the same sources via the same
queries. But problem appears at different time.

More info:

postgres=# show server_version;
server_version
----------------
8.2.4

machupicchu ~ # uname -a
Linux machupicchu 2.6.20-gentoo-r8 #1 SMP Fri Jun 22 09:51:20 MSD 2007
sparc64 sun4u TI UltraSparc IIIi (Jalapeno) GNU/Linux

machupicchu pgdata # grep -v "^#" postgresql.conf | sort -u
autovacuum = on
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 250
autovacuum_naptime = 10min
autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 500
client_min_messages = notice
constraint_exclusion = on
datestyle = 'iso, mdy'
enable_bitmapscan = on
enable_hashagg = on
enable_hashjoin = on
enable_indexscan = on
enable_mergejoin = on
enable_nestloop = on
enable_seqscan = on
enable_sort = on
enable_tidscan = on
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
listen_addresses = '*'
log_directory = 'pg_log'
log_duration = on
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_line_prefix = '%m, %s, %r, %p, '
log_min_duration_statement = 1000
log_min_error_statement = error
log_min_messages = error
log_statement = 'all'
log_truncate_on_rotation = off
maintenance_work_mem = 1572864
max_connections = 100
max_fsm_pages = 153600
redirect_stderr = on
shared_buffers = 24MB
stats_block_level = on
stats_command_string = on
stats_row_level = on
stats_start_collector = on
work_mem = 1048576


Re: Postgresql 8.2.4 on linux-sparc problem

From
Tom Lane
Date:
Zelinskiy Alexander <alexander.zelinskiy@gmail.com> writes:
> I understand that this description can't give you enough information
> to give an advice what to do. But I don't know what to collect from OS/
> PG when it will happen again. Could you give me an advice what should
> I do when next time one of process will use all CPU time?

Before you get into that, try reducing these parameters:

> maintenance_work_mem = 1572864
> work_mem = 1048576

They are way too high, especially the second one.  But if that doesn't
help:

> I guess that I can collect stack from this process at that time by
> using GDB or something like it. But I have no experience how exactly
> to do it.

Yup, that's the thing to do when you don't know what's happening.
The procedure is:

Determine PID of the troublesome backend (use ps or whatever).

As root or the postgres user, do

    $ gdb /path/to/postgres-executable PID
    ... introductory noise, but the list of libraries can be useful ...
    gdb> bt
    ... useful info here ...
    gdb> quit
    OK to detach? y
    $

Necessary advance preparation: be sure gdb is installed and you have a
debug-enabled postgres (install debuginfo RPM if needed).  I'd suggest
practicing and making sure that the backtrace shows routine names and
parameter values.  As an example, a backtrace I was just looking at
looked in part like this:

#0  0x000000310202e174 in xmlCleanupCharEncodingHandlers ()
   from /usr/lib64/libxml2.so.2
#1  0x0000003102036725 in xmlCleanupParser () from /usr/lib64/libxml2.so.2
#2  0x0000000000660e39 in xpath (fcinfo=<value optimized out>) at xml.c:3452
#3  0x000000000053601e in ExecMakeFunctionResult (fcache=0xb32f60,
    econtext=0xb32e68, isNull=0xb33ed8 "", isDone=0xb33f90) at execQual.c:1351
#4  0x00000000005343c6 in ExecProject (projInfo=<value optimized out>,
    isDone=0x0) at execQual.c:4601
#5  0x000000000053f5a5 in ExecAgg (node=0xb32b58) at nodeAgg.c:989
#6  0x0000000000533eaa in ExecProcNode (node=0xb32b58) at execProcnode.c:394

I do not have debuginfo installed for libxml2, thus the first two lines
provide very little information.  The rest look more like what a
developer would like to see.

            regards, tom lane

Re: Postgresql 8.2.4 on linux-sparc problem

From
Lew
Date:
Lew wrote:
> Tom Lane wrote:
>> Before you get into that, try reducing these parameters:
>>
>>> maintenance_work_mem = 1572864
>>> work_mem = 1048576
>>
>> They are way too high, especially the second one.
>
> So if 1.5 MB maintenance_work_mem and 1 MB work_mem are "way too high",
> why are the default values in the postgresql.conf on installation 16 MB
> and 1 MB, respectively?

Replying to my own question - I must learn to check docs *before* asking:
<http://www.postgresql.org/docs/8.2/interactive/config-setting.html>
> Some settings specify a memory or time value. Each of these has an
> implicit unit, which is either kilobytes, blocks (typically eight
> kilobytes), milliseconds, seconds, or minutes.

I was off by a factor of 1024 in how I read that.  The OP had specified 1.5
*GB* and 1 *GB* respectively!

--
Lew

Re: Postgresql 8.2.4 on linux-sparc problem

From
Lew
Date:
Tom Lane wrote:
> Before you get into that, try reducing these parameters:
>
>> maintenance_work_mem = 1572864
>> work_mem = 1048576
>
> They are way too high, especially the second one.

So if 1.5 MB maintenance_work_mem and 1 MB work_mem are "way too high", why
are the default values in the postgresql.conf on installation 16 MB and 1 MB,
respectively?

--
Lew

Re: Postgresql 8.2.4 on linux-sparc problem

From
Zelinskiy Alexander
Date:
Hello, Tom.

One again it happen.

postgres 11009     1       0 Jan12 ?        00:00:12 /usr/bin/
postmaster -D /data/idx/pgdata --silent-mode=true
postgres 11027 11009  0 Jan12 ?        00:26:55 postgres: logger process
postgres 11029 11009  0 Jan12 ?        00:00:21 postgres: writer process
postgres 11030 11009  0 Jan12 ?        00:16:50 postgres: stats
collector process
postgres 16751 11009  1 20:40 ?        00:00:12 postgres: stat stat
10.0.0.2(41239) idle
postgres 16753 11009  0 20:40 ?        00:00:11 postgres: stat stat
10.0.0.2(41244) idle
postgres 16758 11009  3 20:41 ?        00:00:35 postgres: stat stat
10.0.0.2(50546) SELECT
postgres 16760 11009  0 20:42 ?        00:00:00 postgres: stat stat
10.0.0.2(50573) idle
    postgres 16761 11009 99 20:42 ?        00:16:59 postgres: stat stat
10.0.0.2(50577) idle
postgres 16762 11009  0 20:43 ?        00:00:00 postgres: stat stat
10.0.0.2(50603) INSERT

I tried to use gdb but no success.

machupicchu ~ # gdb /usr/bin/postgres 16761
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "sparc-unknown-linux-gnu"...
Using host libthread_db library "/lib/libthread_db.so.1".
Attaching to program: /usr/bin/postgres, process 16761
....

I enabled  full logging to find a query after which pg die.  From
postgresql.conf:
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_truncate_on_rotation = off
log_min_messages = error
log_min_error_statement = error
log_min_duration_statement = 1000
log_duration = on
log_line_prefix = '%m, %s, %r, %p, '
log_statement = 'all'

But I can't find nothing about PID 16761 or tcp port 50577 in logs! I
can find neighbour, but not this process.

Else:
# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----
cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
4 24   1408  25832 180800 3738664    0    0     5    23   31    5  8
1 84  6
4 24   1408  25832 180800 3738664    0    0     0     0 1010   14  0
50  0 50
4 24   1408  25832 180800 3738664    0    0     0     0 1005   11  0
50  0 50
4 24   1408  25832 180800 3738664    0    0     0     0 1005   11  0
50  0 50
4 24   1408  25832 180800 3738664    0    0     0     0 1004   11  0
50  0 50
4 24   1408  25832 180800 3738664    0    0     0     0 1005   13  0
50  0 50
4 24   1408  25832 180800 3738664    0    0     0     0 1004   13  0
50  0 50
4 24   1408  25832 180800 3738664    0    0     0     0 1004   11  0
50  0 50

Looks like problem is in IO-wait?

What to do? Where to dig?

In some weeks a want to migrate to 2 redundant sun fire v440 servers
with 4 storages. Main idea is to have HW redundancy. But now.. Don't
know what to say to my boss. HW is fine, but a lot of data loss.. :)

p.s. I commented my configuration for _mem options. Now it's default.

Re: Postgresql 8.2.4 on linux-sparc problem

From
Tom Lane
Date:
Zelinskiy Alexander <alexander.zelinskiy@gmail.com> writes:
> One again it happen.

Looking back at your original post ... maybe you should try a non-Gentoo
kernel.  Gentoo doesn't have the greatest reputation for stability, and
on a non-mainstream architecture like sparc64 it'd likely be even buggier
than its usual reputation.  The symptoms you cite are consistent with a
disk operation getting stuck in the kernel, and a system that locks up
to the point of requiring a hard reset sounds even more like a kernel
problem.

            regards, tom lane