Thread: Strange behaviour under heavy load
Hello.
I have found that sometimes heavy loaded PostgreSQL begins to run all the queries slower than usual, sometimes - 5 and more times slower. I cannot reprocude that, but symptoms are the following: queries work very fast for 5-10 minutes, and after that - significant slowdown (every query, even a simple one, works 5-10 and ever more times slower), disk write activity grows too (but not everytime - I don't know exactli if there is a direct correlation). 2-3 seconds, then - performance restores back to normal.
Autovacuum is turned off.
Machine has 2 processors, huge memory, fast SCSI disks.
I understand that there is too less information. Please advice what to monitor for better problem discovering.
I have found that sometimes heavy loaded PostgreSQL begins to run all the queries slower than usual, sometimes - 5 and more times slower. I cannot reprocude that, but symptoms are the following: queries work very fast for 5-10 minutes, and after that - significant slowdown (every query, even a simple one, works 5-10 and ever more times slower), disk write activity grows too (but not everytime - I don't know exactli if there is a direct correlation). 2-3 seconds, then - performance restores back to normal.
Autovacuum is turned off.
Machine has 2 processors, huge memory, fast SCSI disks.
I understand that there is too less information. Please advice what to monitor for better problem discovering.
hmm, looks like checkpoint ? Check checkpoint settings in postgresql.conf. Oleg On Tue, 27 Mar 2007, Dmitry Koterov wrote: > Hello. > > I have found that sometimes heavy loaded PostgreSQL begins to run all the > queries slower than usual, sometimes - 5 and more times slower. I cannot > reprocude that, but symptoms are the following: queries work very fast for > 5-10 minutes, and after that - significant slowdown (every query, even a > simple one, works 5-10 and ever more times slower), disk write activity > grows too (but not everytime - I don't know exactli if there is a direct > correlation). 2-3 seconds, then - performance restores back to normal. > > Autovacuum is turned off. > > Machine has 2 processors, huge memory, fast SCSI disks. > > I understand that there is too less information. Please advice what to > monitor for better problem discovering. > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
New information.
When I run CHECKPOINT manually in psql console, effect is fully reproduced.
So - it is a checkpoint slowdown.
The quesion is - how to make CHECKPOINT work faster?
When I run CHECKPOINT manually in psql console, effect is fully reproduced.
So - it is a checkpoint slowdown.
The quesion is - how to make CHECKPOINT work faster?
On 3/27/07, Dmitry Koterov <dmitry@koterov.ru> wrote:
How to tune them?
Now I have:
checkpoint_segments = 5 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 900 # range 30-3600, in seconds
checkpoint_warning = 30 # in seconds, 0 is off
No checkpoint warnings in pgsql logs.On 3/27/07, Oleg Bartunov < oleg@sai.msu.su> wrote:hmm,
looks like checkpoint ?
Check checkpoint settings in postgresql.conf.
Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:
> Hello.
>
> I have found that sometimes heavy loaded PostgreSQL begins to run all the
> queries slower than usual, sometimes - 5 and more times slower. I cannot
> reprocude that, but symptoms are the following: queries work very fast for
> 5-10 minutes, and after that - significant slowdown (every query, even a
> simple one, works 5-10 and ever more times slower), disk write activity
> grows too (but not everytime - I don't know exactli if there is a direct
> correlation). 2-3 seconds, then - performance restores back to normal.
>
> Autovacuum is turned off.
>
> Machine has 2 processors, huge memory, fast SCSI disks.
>
> I understand that there is too less information. Please advice what to
> monitor for better problem discovering.
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet ( www.astronet.ru ),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
Disc write activity timeline looks like the following:
CHECKPOINT ->
Large disc write (30-40M/s), 1 second ->
4-5 seconds: no read and write activity at all, but server works slow ->
2-3 seconds: disc write activity about 4-5 M/s ->
restore normal speed
CHECKPOINT ->
Large disc write (30-40M/s), 1 second ->
4-5 seconds: no read and write activity at all, but server works slow ->
2-3 seconds: disc write activity about 4-5 M/s ->
restore normal speed
On 3/28/07, Dmitry Koterov <dmitry@koterov.ru> wrote:
New information.
When I run CHECKPOINT manually in psql console, effect is fully reproduced.
So - it is a checkpoint slowdown.
The quesion is - how to make CHECKPOINT work faster?On 3/27/07, Dmitry Koterov <dmitry@koterov.ru> wrote:How to tune them?
Now I have:
checkpoint_segments = 5 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 900 # range 30-3600, in seconds
checkpoint_warning = 30 # in seconds, 0 is off
No checkpoint warnings in pgsql logs.On 3/27/07, Oleg Bartunov < oleg@sai.msu.su> wrote:hmm,
looks like checkpoint ?
Check checkpoint settings in postgresql.conf.
Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:
> Hello.
>
> I have found that sometimes heavy loaded PostgreSQL begins to run all the
> queries slower than usual, sometimes - 5 and more times slower. I cannot
> reprocude that, but symptoms are the following: queries work very fast for
> 5-10 minutes, and after that - significant slowdown (every query, even a
> simple one, works 5-10 and ever more times slower), disk write activity
> grows too (but not everytime - I don't know exactli if there is a direct
> correlation). 2-3 seconds, then - performance restores back to normal.
>
> Autovacuum is turned off.
>
> Machine has 2 processors, huge memory, fast SCSI disks.
>
> I understand that there is too less information. Please advice what to
> monitor for better problem discovering.
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet ( www.astronet.ru ),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
I'm doing dry runs to migrate my database from 8.1.3 to 8.2.3, with pg_dumpall | gg_restore, using the executables from 8.2.3. I'm seeing these messages ERROR: unexpected data beyond EOF in block 4506 of relation "coverage_test_val" HINT: This has been seen to occur with buggy kernels; consider updating your system. CONTEXT: COPY coverage_test_val, line 378036: "533359 61566 2007-02-01 DELINQUENCY6 16 0.356509255634 0.5 7" ERROR: unexpected data beyond EOF in block 414 of relation "rating" HINT: This has been seen to occur with buggy kernels; consider updating your system. CONTEXT: COPY rating, line 56387: "840942 19548 2005-10-25 1 25 \N 1 \N" The migration otherwise finishes successfully, but those two tables with the EOF problem have count(*) = 0. Can this really be a buggy kernel? Is there a list of known Linux releases that are buggy? Of course, upgrading the kernel on a production machine is the last thing I want to try... And here's a theory: a while ago, someone installed pl/R on the database to do statistics. Executing some pl/R functions would produce a crash of all the connected clients, and I'd see "unexpected EOF" messages. I immediately said we needed pl/R uninstalled, and the crashes stopped. The encoding of my database is SQL_ASCII, which from what I've been reading, is a non-encoding and will not perform any validation. Would it be possible that some invalid characters crept into tables related to those pl/R crashes? Would porting to UTF8 fix this? Is there a way I can fix those two cases of unexpected EOF so I can at least perform the upgrade? Thank you, Jaime *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
Jaime Silvela <JSilvela@Bear.com> writes: > I'm seeing these messages > ERROR: unexpected data beyond EOF in block 4506 of relation > "coverage_test_val" > HINT: This has been seen to occur with buggy kernels; consider updating > your system. So what kernel are you running, on what hardware? Is the problem reproducible if you drop the target database and start over? The known cases of this problem have been under heavy concurrent update load (multiple sessions trying to extend a table concurrently). I don't entirely see how you could get it if the COPY is the only thing writing that table ... regards, tom lane
The kernel is Linux 2.6.5 I'll drop and redo this evening after hours - the migration takes a couple of hours. The source database is under medium load, with a couple of heavy updates, but the target database is a clean 8.2.3 install, and only being written to by pg_restore. So I take it that this happens on the restore part, not the dump part then? Thanks Jaime Tom Lane wrote: > Jaime Silvela <JSilvela@Bear.com> writes: > >> I'm seeing these messages >> > > >> ERROR: unexpected data beyond EOF in block 4506 of relation >> "coverage_test_val" >> HINT: This has been seen to occur with buggy kernels; consider updating >> your system. >> > > So what kernel are you running, on what hardware? Is the problem > reproducible if you drop the target database and start over? The > known cases of this problem have been under heavy concurrent update > load (multiple sessions trying to extend a table concurrently). > I don't entirely see how you could get it if the COPY is the only > thing writing that table ... > > regards, tom lane > > *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
Jaime Silvela <JSilvela@bear.com> writes: > The kernel is Linux 2.6.5 2.6.5.what (give us full uname -a output please)? On what hardware? If memory serves, we had some reason to think that it only occurred on specific 64-bit multi-CPU platforms, so I'm not just asking idly. The error check was put in as a consequence of this thread: http://archives.postgresql.org/pgsql-admin/2006-09/msg00250.php > So I take it that this happens on the restore part, not the dump part then? It's definitely a write failure not a read failure. Per the code comments: * We get here only in the corner case where we are trying to extend * the relation but we found a pre-existing buffer marked BM_VALID. * This can happen because mdread doesn't complain about reads beyond * EOF --- which is arguably bogus, but changing it seems tricky --- * and so a previous attempt to read a block just beyond EOF could * have left a "valid" zero-filled buffer. Unfortunately, we have * also seen this case occurring because of buggy Linux kernels that * sometimes return an lseek(SEEK_END) result that doesn't account for * a recent write. In that situation, the pre-existing buffer would * contain valid data that we don't want to overwrite. Since the * legitimate cases should always have left a zero-filled buffer, * complain if not PageIsNew. regards, tom lane
Sorry, forgot about the hardware. Here's the uname -a: Linux wdlbc22r06 2.6.5-7.244-bigsmp #1 SMP Mon Dec 12 18:32:25 UTC 2005 i686 i686 i386 GNU/Linux From hwinfo --cpu 01: None 00.0: 10103 CPU [Created at cpu.290] Unique ID: rdCR.j8NaKXDZtZ6 Hardware Class: cpu Arch: Intel Vendor: "GenuineIntel" Model: 15.4.3 "Intel(R) Xeon(TM) CPU 3.60GHz" Features: fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,lm,pni,monitor,ds_cpl,tm2,est,cid Clock: 3600 MHz Cache: 2048 kb Units/Processor: 2 Config Status: cfg=new, avail=yes, need=no, active=unknown From hwinfo --memory 01: None 00.0: 10102 Main Memory [Created at memory.59] Unique ID: rdCR.CxwsZFjVASF Hardware Class: memory Model: "Main Memory" Memory Range: 0x00000000-0xfff5ba3f (rw) Memory Size: 4 GB Config Status: cfg=no, avail=yes, need=no, active=unknown We're running on 32-bit mode for compatibility with some libraries. How can I determine whether this is due to a buggy kernel? Tom Lane wrote: > Jaime Silvela <JSilvela@bear.com> writes: > >> The kernel is Linux 2.6.5 >> > > 2.6.5.what (give us full uname -a output please)? On what hardware? > If memory serves, we had some reason to think that it only occurred on > specific 64-bit multi-CPU platforms, so I'm not just asking idly. > The error check was put in as a consequence of this thread: > http://archives.postgresql.org/pgsql-admin/2006-09/msg00250.php > > >> So I take it that this happens on the restore part, not the dump part then? >> > > It's definitely a write failure not a read failure. Per the code comments: > > * We get here only in the corner case where we are trying to extend > * the relation but we found a pre-existing buffer marked BM_VALID. > * This can happen because mdread doesn't complain about reads beyond > * EOF --- which is arguably bogus, but changing it seems tricky --- > * and so a previous attempt to read a block just beyond EOF could > * have left a "valid" zero-filled buffer. Unfortunately, we have > * also seen this case occurring because of buggy Linux kernels that > * sometimes return an lseek(SEEK_END) result that doesn't account for > * a recent write. In that situation, the pre-existing buffer would > * contain valid data that we don't want to overwrite. Since the > * legitimate cases should always have left a zero-filled buffer, > * complain if not PageIsNew. > > regards, tom lane > > *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
Jaime Silvela <JSilvela@bear.com> writes: > Here's the uname -a: > Linux wdlbc22r06 2.6.5-7.244-bigsmp #1 SMP Mon Dec 12 18:32:25 UTC 2005 > i686 i686 i386 GNU/Linux The previous thread says >> SLES support helped upgrade our SLES 9, sp3, linux kernel from 2.6.5-7.244 >> to 2.6.5-7.282. Since that we haven't had any blocks of rows beign >> re-written or blanked out by the kernel. The new kernel is handling the >> wirtes much better. While I can't say that his 2.6.5-7.244 is the same as yours, it does seem a bit too close for comfort. regards, tom lane
Wow, thanks! I'll definitely get the Unix operations guys to investigate. Do you know if this is tracked as an issue for Linux? I'd like some reassurance that whatever release Unix Ops is able to upgrade to fixes the bug. I'll try to get one of our machines upgraded and test there, and will let you know if I find anything on my part. Thanks again Jaime Tom Lane wrote: > Jaime Silvela <JSilvela@bear.com> writes: > >> Here's the uname -a: >> Linux wdlbc22r06 2.6.5-7.244-bigsmp #1 SMP Mon Dec 12 18:32:25 UTC 2005 >> i686 i686 i386 GNU/Linux >> > > The previous thread says > > >>> SLES support helped upgrade our SLES 9, sp3, linux kernel from 2.6.5-7.244 >>> to 2.6.5-7.282. Since that we haven't had any blocks of rows beign >>> re-written or blanked out by the kernel. The new kernel is handling the >>> wirtes much better. >>> > > While I can't say that his 2.6.5-7.244 is the same as yours, it does > seem a bit too close for comfort. > > regards, tom lane > > *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
Jaime Silvela <JSilvela@bear.com> writes: > Wow, thanks! I'll definitely get the Unix operations guys to investigate. > Do you know if this is tracked as an issue for Linux? I'd like some > reassurance that whatever release Unix Ops is able to upgrade to fixes > the bug. I did some idle searching when we were looking at the previous report but couldn't find any specific kernel patches about this. It's presumably known on the kernel side but maybe not under the specific "lseek reports stale EOF position" guise that we know it as. If you manage to track it down please do post back here, because we'd all like to know. regards, tom lane
How to tune them?
Now I have:
checkpoint_segments = 5 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 900 # range 30-3600, in seconds
checkpoint_warning = 30 # in seconds, 0 is off
No checkpoint warnings in pgsql logs.
Now I have:
checkpoint_segments = 5 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 900 # range 30-3600, in seconds
checkpoint_warning = 30 # in seconds, 0 is off
No checkpoint warnings in pgsql logs.
On 3/27/07, Oleg Bartunov <oleg@sai.msu.su> wrote:
hmm,
looks like checkpoint ?
Check checkpoint settings in postgresql.conf.
Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:
> Hello.
>
> I have found that sometimes heavy loaded PostgreSQL begins to run all the
> queries slower than usual, sometimes - 5 and more times slower. I cannot
> reprocude that, but symptoms are the following: queries work very fast for
> 5-10 minutes, and after that - significant slowdown (every query, even a
> simple one, works 5-10 and ever more times slower), disk write activity
> grows too (but not everytime - I don't know exactli if there is a direct
> correlation). 2-3 seconds, then - performance restores back to normal.
>
> Autovacuum is turned off.
>
> Machine has 2 processors, huge memory, fast SCSI disks.
>
> I understand that there is too less information. Please advice what to
> monitor for better problem discovering.
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru ),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly