Thread: rebellious postgres process
This is what I see in "top" since two days: last pid: 95702; load averages: 2.48, 2.69, 2.75 up 3+02:08:49 10:27:58 257 processes: 3 running, 246 sleeping, 8 zombie CPU states: 15.5% user, 0.0% nice, 22.1% system, 1.3% interrupt, 61.1% idle Mem: 864M Active, 6028M Inact, 512M Wired, 254M Cache, 214M Buf, 211M Free Swap: 10G Total, 500K Used, 10G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 677 pgsql 1 107 0 22396K 5460K CPU3 6 52.4H 99.02% postgres The processi is using 100% CPU for 52 hours! Here it is: %ps axl | grep 677 70 677 666 757 107 0 22396 5460 select Rs ?? 3144:50.88 postgres: stats collector process (postgres) Is this normal? Fortunately, we have more processors in this system and it is not causing slow-down. But I really need to know what is happening here. I believe it is not the way it should work. Yesterday the system was restarted suddenly, and this process might be the reason for this. After the restart, the process is doing the same thing again: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 904 pgsql 1 105 0 22396K 5332K CPU3 3 386:05 99.22% postgres Here is some more info: Hardware: Two Quad Xeon 2.5GHz processors, 8GB RAM, RAID with 2GB cache and BBU, 10 SATA2 disks in RAID 1+0 OS: FreeBSD 7.0 amd64 Postgresql: 8.3 Databases: two bigger ones, total database size on disk is about 10GB, used mostly read-only, but some smaller tables updated frequently. I don't think that it needs stat collection continuously. Thanks, Laszlo
Laszlo Nagy <gandalf@shopzeus.com> writes: > The processi is using 100% CPU for 52 hours! Here it is: > %ps axl | grep 677 > 70 677 666 757 107 0 22396 5460 select Rs ?? 3144:50.88 > postgres: stats collector process (postgres) Huh, that's weird. We've fixed some bugs in the past that led the stats collector to consume excessive CPU --- but that was all pre-8.3. Do you have a whole lot of tables in this database? (Or even more directly: how large is $PGDATA/global/pgstat.stat ?) regards, tom lane
Tom Lane wrote: > Laszlo Nagy <gandalf@shopzeus.com> writes: > >> The processi is using 100% CPU for 52 hours! Here it is: >> > > >> %ps axl | grep 677 >> 70 677 666 757 107 0 22396 5460 select Rs ?? 3144:50.88 >> postgres: stats collector process (postgres) >> > > Huh, that's weird. We've fixed some bugs in the past that led the stats > collector to consume excessive CPU --- but that was all pre-8.3. > My version is 8.3.3. Complied from FreeBSD ports. I just updated my ports tree and that is the most up to date officially (sup)ported version. The process is still running: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 904 pgsql 1 105 0 22396K 5428K CPU1 1 856:13 99.02% postgres > Do you have a whole lot of tables in this database? One database has 140 tables, the other has 62. (There are two other database instances but they are rarely used.) > (Or even more directly: how large is $PGDATA/global/pgstat.stat ?) > -rw------- 1 pgsql pgsql 54536 Nov 3 20:11 pgstat.stat Thanks, Laszlo
Tom Lane wrote: > Huh, that's weird. We've fixed some bugs in the past that led the stats > collector to consume excessive CPU --- but that was all pre-8.3. > The server was rebooting intermittently, so we replaced the RAM (we got a kernel page fault). But it was a week ago. The server is now stable. But is it possible that somehow the file system became inconsistent, and that is causing an infinite loop in the stats collector? Just guessing.
Laszlo Nagy <gandalf@shopzeus.com> writes: > The process is still running: Could you attach to it with gdb and see what it's doing? >> Do you have a whole lot of tables in this database? > One database has 140 tables, the other has 62. (There are two other > database instances but they are rarely used.) >> (Or even more directly: how large is $PGDATA/global/pgstat.stat ?) >> > -rw------- 1 pgsql pgsql 54536 Nov 3 20:11 pgstat.stat Well, that lets out the theory that it just has a whole lot of stats to keep track of ... although it's quite interesting that the file timestamp isn't current. Somehow it's evidently gotten wedged in a way that prevents it from updating the file. regards, tom lane
On Tue, Nov 4, 2008 at 8:48 AM, Laszlo Nagy <gandalf@shopzeus.com> wrote: > Tom Lane wrote: >> >> Huh, that's weird. We've fixed some bugs in the past that led the stats >> collector to consume excessive CPU --- but that was all pre-8.3. >> > > The server was rebooting intermittently, so we replaced the RAM (we got a > kernel page fault). But it was a week ago. The server is now stable. But is > it possible that somehow the file system became inconsistent, and that is > causing an infinite loop in the stats collector? Just guessing. Yes, you really can't trust any data that was written to the drives while the bad memory was in place.
"Scott Marlowe" <scott.marlowe@gmail.com> writes: > On Tue, Nov 4, 2008 at 8:48 AM, Laszlo Nagy <gandalf@shopzeus.com> wrote: >> The server was rebooting intermittently, so we replaced the RAM (we got a >> kernel page fault). But it was a week ago. The server is now stable. But is >> it possible that somehow the file system became inconsistent, and that is >> causing an infinite loop in the stats collector? Just guessing. > Yes, you really can't trust any data that was written to the drives > while the bad memory was in place. Still, it's quite unclear how bad data read from the stats file could have led to an infinite loop. The stats file format is pretty "flat" and AFAICS the worst effect of undetected corruption would be to have wrong count values for some tables/databases. regards, tom lane
On Tue, Nov 4, 2008 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Scott Marlowe" <scott.marlowe@gmail.com> writes: >> On Tue, Nov 4, 2008 at 8:48 AM, Laszlo Nagy <gandalf@shopzeus.com> wrote: >>> The server was rebooting intermittently, so we replaced the RAM (we got a >>> kernel page fault). But it was a week ago. The server is now stable. But is >>> it possible that somehow the file system became inconsistent, and that is >>> causing an infinite loop in the stats collector? Just guessing. > >> Yes, you really can't trust any data that was written to the drives >> while the bad memory was in place. > > Still, it's quite unclear how bad data read from the stats file could > have led to an infinite loop. The stats file format is pretty "flat" > and AFAICS the worst effect of undetected corruption would be to have > wrong count values for some tables/databases. True. Is it possible some other bit of the data in the system was corrupted and freaking out the stats collector?
"Scott Marlowe" <scott.marlowe@gmail.com> writes: > On Tue, Nov 4, 2008 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Still, it's quite unclear how bad data read from the stats file could >> have led to an infinite loop. The stats file format is pretty "flat" >> and AFAICS the worst effect of undetected corruption would be to have >> wrong count values for some tables/databases. > True. Is it possible some other bit of the data in the system was > corrupted and freaking out the stats collector? The collector is pretty self-contained by design, so it's hard to see what else would affect it. I'm wondering a bit if the OP's hardware is still flaky :-(. In any case it would sure be interesting to see a few stack traces from the process. regards, tom lane
> The collector is pretty self-contained by design, so it's hard to see > what else would affect it. I'm wondering a bit if the OP's hardware is > still flaky :-(. In any case it would sure be interesting to see a few > stack traces from the process. > Maybe it is still flaky, because yesterday it restarted itself 4 times. Now the collector is working normally. :-( I'm sorry.