Thread: rebellious postgres process

rebellious postgres process

From
Laszlo Nagy
Date:
This is what I see in "top" since two days:

last pid: 95702;  load averages:  2.48,  2.69,
2.75
up 3+02:08:49  10:27:58
257 processes: 3 running, 246 sleeping, 8 zombie
CPU states: 15.5% user,  0.0% nice, 22.1% system,  1.3% interrupt, 61.1%
idle
Mem: 864M Active, 6028M Inact, 512M Wired, 254M Cache, 214M Buf, 211M Free
Swap: 10G Total, 500K Used, 10G Free

 PID USERNAME       THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU
COMMAND
 677 pgsql            1 107    0 22396K  5460K CPU3   6  52.4H 99.02%
postgres

The processi is using 100% CPU for 52 hours! Here it is:

%ps axl | grep 677
  70   677   666 757 107  0 22396  5460 select Rs    ??  3144:50.88
postgres: stats collector process    (postgres)


Is this normal? Fortunately, we have more processors in this system and
it is not causing slow-down. But I really need to know what is happening
here. I believe it is not the way it should work.

Yesterday the system was restarted suddenly, and this process might be
the reason for this. After the restart, the process is doing the same
thing again:

  PID USERNAME       THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU
COMMAND
  904 pgsql            1 105    0 22396K  5332K CPU3   3 386:05 99.22%
postgres


Here is some more info:

Hardware: Two Quad Xeon 2.5GHz processors, 8GB RAM, RAID with 2GB cache
and BBU, 10 SATA2 disks in RAID 1+0 OS: FreeBSD 7.0 amd64
Postgresql: 8.3
Databases: two bigger ones, total database size on disk is about 10GB,
used mostly read-only, but some smaller tables updated frequently.

I don't think that it needs stat collection continuously.

Thanks,

  Laszlo


Re: rebellious postgres process

From
Tom Lane
Date:
Laszlo Nagy <gandalf@shopzeus.com> writes:
> The processi is using 100% CPU for 52 hours! Here it is:

> %ps axl | grep 677
>   70   677   666 757 107  0 22396  5460 select Rs    ??  3144:50.88
> postgres: stats collector process    (postgres)

Huh, that's weird.  We've fixed some bugs in the past that led the stats
collector to consume excessive CPU --- but that was all pre-8.3.

Do you have a whole lot of tables in this database?  (Or even more
directly: how large is $PGDATA/global/pgstat.stat ?)

            regards, tom lane

Re: rebellious postgres process

From
Laszlo Nagy
Date:
Tom Lane wrote:
> Laszlo Nagy <gandalf@shopzeus.com> writes:
>
>> The processi is using 100% CPU for 52 hours! Here it is:
>>
>
>
>> %ps axl | grep 677
>>   70   677   666 757 107  0 22396  5460 select Rs    ??  3144:50.88
>> postgres: stats collector process    (postgres)
>>
>
> Huh, that's weird.  We've fixed some bugs in the past that led the stats
> collector to consume excessive CPU --- but that was all pre-8.3.
>
My version is 8.3.3. Complied from FreeBSD ports.  I just updated my
ports tree and that is the most up to date officially (sup)ported version.

The process is still running:


  PID USERNAME       THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU
COMMAND
  904 pgsql            1 105    0 22396K  5428K CPU1   1 856:13 99.02%
postgres

> Do you have a whole lot of tables in this database?
One database has 140 tables, the other has 62. (There are two other
database instances but they are rarely used.)
> (Or even more directly: how large is $PGDATA/global/pgstat.stat ?)
>
-rw-------  1 pgsql  pgsql  54536 Nov  3 20:11 pgstat.stat

Thanks,

    Laszlo



Re: rebellious postgres process

From
Laszlo Nagy
Date:
Tom Lane wrote:
> Huh, that's weird.  We've fixed some bugs in the past that led the stats
> collector to consume excessive CPU --- but that was all pre-8.3.
>

The server was rebooting intermittently, so we replaced the RAM (we got
a kernel page fault). But it was a week ago. The server is now stable.
But is it possible that somehow the file system became inconsistent, and
that is causing an infinite loop in the stats collector? Just guessing.


Re: rebellious postgres process

From
Tom Lane
Date:
Laszlo Nagy <gandalf@shopzeus.com> writes:
> The process is still running:

Could you attach to it with gdb and see what it's doing?

>> Do you have a whole lot of tables in this database?
> One database has 140 tables, the other has 62. (There are two other
> database instances but they are rarely used.)
>> (Or even more directly: how large is $PGDATA/global/pgstat.stat ?)
>>
> -rw-------  1 pgsql  pgsql  54536 Nov  3 20:11 pgstat.stat

Well, that lets out the theory that it just has a whole lot of stats
to keep track of ... although it's quite interesting that the file
timestamp isn't current.  Somehow it's evidently gotten wedged in
a way that prevents it from updating the file.

            regards, tom lane

Re: rebellious postgres process

From
"Scott Marlowe"
Date:
On Tue, Nov 4, 2008 at 8:48 AM, Laszlo Nagy <gandalf@shopzeus.com> wrote:
> Tom Lane wrote:
>>
>> Huh, that's weird.  We've fixed some bugs in the past that led the stats
>> collector to consume excessive CPU --- but that was all pre-8.3.
>>
>
> The server was rebooting intermittently, so we replaced the RAM (we got a
> kernel page fault). But it was a week ago. The server is now stable. But is
> it possible that somehow the file system became inconsistent, and that is
> causing an infinite loop in the stats collector? Just guessing.

Yes, you really can't trust any data that was written to the drives
while the bad memory was in place.

Re: rebellious postgres process

From
Tom Lane
Date:
"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> On Tue, Nov 4, 2008 at 8:48 AM, Laszlo Nagy <gandalf@shopzeus.com> wrote:
>> The server was rebooting intermittently, so we replaced the RAM (we got a
>> kernel page fault). But it was a week ago. The server is now stable. But is
>> it possible that somehow the file system became inconsistent, and that is
>> causing an infinite loop in the stats collector? Just guessing.

> Yes, you really can't trust any data that was written to the drives
> while the bad memory was in place.

Still, it's quite unclear how bad data read from the stats file could
have led to an infinite loop.  The stats file format is pretty "flat"
and AFAICS the worst effect of undetected corruption would be to have
wrong count values for some tables/databases.

            regards, tom lane

Re: rebellious postgres process

From
"Scott Marlowe"
Date:
On Tue, Nov 4, 2008 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Scott Marlowe" <scott.marlowe@gmail.com> writes:
>> On Tue, Nov 4, 2008 at 8:48 AM, Laszlo Nagy <gandalf@shopzeus.com> wrote:
>>> The server was rebooting intermittently, so we replaced the RAM (we got a
>>> kernel page fault). But it was a week ago. The server is now stable. But is
>>> it possible that somehow the file system became inconsistent, and that is
>>> causing an infinite loop in the stats collector? Just guessing.
>
>> Yes, you really can't trust any data that was written to the drives
>> while the bad memory was in place.
>
> Still, it's quite unclear how bad data read from the stats file could
> have led to an infinite loop.  The stats file format is pretty "flat"
> and AFAICS the worst effect of undetected corruption would be to have
> wrong count values for some tables/databases.

True.  Is it possible some other bit of the data in the system was
corrupted and freaking out the stats collector?

Re: rebellious postgres process

From
Tom Lane
Date:
"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> On Tue, Nov 4, 2008 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Still, it's quite unclear how bad data read from the stats file could
>> have led to an infinite loop.  The stats file format is pretty "flat"
>> and AFAICS the worst effect of undetected corruption would be to have
>> wrong count values for some tables/databases.

> True.  Is it possible some other bit of the data in the system was
> corrupted and freaking out the stats collector?

The collector is pretty self-contained by design, so it's hard to see
what else would affect it.  I'm wondering a bit if the OP's hardware is
still flaky :-(.  In any case it would sure be interesting to see a few
stack traces from the process.

            regards, tom lane

Re: rebellious postgres process

From
Laszlo Nagy
Date:
> The collector is pretty self-contained by design, so it's hard to see
> what else would affect it.  I'm wondering a bit if the OP's hardware is
> still flaky :-(.  In any case it would sure be interesting to see a few
> stack traces from the process.
>
Maybe it is still flaky, because yesterday it restarted itself 4 times.
Now the collector is working normally. :-( I'm sorry.