Thread: Strange behaviour under heavy load

Strange behaviour under heavy load

From

"Dmitry Koterov"

Date:

27 March 2007, 13:34:39

Hello.

I have found that sometimes heavy loaded PostgreSQL begins to run all the queries slower than usual, sometimes - 5 and more times slower. I cannot reprocude that, but symptoms are the following: queries work very fast for 5-10 minutes, and after that - significant slowdown (every query, even a simple one, works 5-10 and ever more times slower), disk write activity grows too (but not everytime - I don't know exactli if there is a direct correlation). 2-3 seconds, then - performance restores back to normal.

Autovacuum is turned off.

Machine has 2 processors, huge memory, fast SCSI disks.

I understand that there is too less information. Please advice what to monitor for better problem discovering.

Re: Strange behaviour under heavy load

From

Oleg Bartunov

Date:

27 March 2007, 13:57:30

hmm,

looks like checkpoint ?
Check checkpoint settings in postgresql.conf.


Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:

> Hello.
>
> I have found that sometimes heavy loaded PostgreSQL begins to run all the
> queries slower than usual, sometimes - 5 and more times slower. I cannot
> reprocude that, but symptoms are the following: queries work very fast for
> 5-10 minutes, and after that - significant slowdown (every query, even a
> simple one, works 5-10 and ever more times slower), disk write activity
> grows too (but not everytime - I don't know exactli if there is a direct
> correlation). 2-3 seconds, then - performance restores back to normal.
>
> Autovacuum is turned off.
>
> Machine has 2 processors, huge memory, fast SCSI disks.
>
> I understand that there is too less information. Please advice what to
> monitor for better problem discovering.
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Strange behaviour under heavy load

From

"Dmitry Koterov"

Date:

28 March 2007, 05:40:32

New information.

When I run CHECKPOINT manually in psql console, effect is fully reproduced.
So - it is a checkpoint slowdown.

The quesion is - how to make CHECKPOINT work faster?

On 3/27/07, Dmitry Koterov <dmitry@koterov.ru> wrote:

How to tune them?

Now I have:

checkpoint_segments = 5         # in logfile segments, min 1, 16MB each
checkpoint_timeout = 900        # range 30-3600, in seconds
checkpoint_warning = 30         # in seconds, 0 is off

No checkpoint warnings in pgsql logs.

On 3/27/07, Oleg Bartunov < oleg@sai.msu.su> wrote:
hmm,

looks like checkpoint ?
Check checkpoint settings in postgresql.conf.

Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:

> Hello.
>
> I have found that sometimes heavy loaded PostgreSQL begins to run all the
> queries slower than usual, sometimes - 5 and more times slower. I cannot
> reprocude that, but symptoms are the following: queries work very fast for
> 5-10 minutes, and after that - significant slowdown (every query, even a
> simple one, works 5-10 and ever more times slower), disk write activity
> grows too (but not everytime - I don't know exactli if there is a direct
> correlation). 2-3 seconds, then - performance restores back to normal.
>
> Autovacuum is turned off.
>
> Machine has 2 processors, huge memory, fast SCSI disks.
>
> I understand that there is too less information. Please advice what to
> monitor for better problem discovering.
>

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet ( www.astronet.ru ),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly

Re: Strange behaviour under heavy load

From

"Dmitry Koterov"

Date:

28 March 2007, 07:12:54

Disc write activity timeline looks like the following:

CHECKPOINT ->
Large disc write (30-40M/s), 1 second ->
4-5 seconds: no read and write activity at all, but server works slow ->
2-3 seconds: disc write activity about 4-5 M/s ->
restore normal speed

On 3/28/07, Dmitry Koterov <dmitry@koterov.ru> wrote:

New information.

When I run CHECKPOINT manually in psql console, effect is fully reproduced.
So - it is a checkpoint slowdown.

The quesion is - how to make CHECKPOINT work faster?

On 3/27/07, Dmitry Koterov <dmitry@koterov.ru> wrote:
How to tune them?

Now I have:

checkpoint_segments = 5         # in logfile segments, min 1, 16MB each
checkpoint_timeout = 900        # range 30-3600, in seconds
checkpoint_warning = 30         # in seconds, 0 is off

No checkpoint warnings in pgsql logs.

On 3/27/07, Oleg Bartunov < oleg@sai.msu.su> wrote:
hmm,

looks like checkpoint ?
Check checkpoint settings in postgresql.conf.

Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:

> Hello.
>
> I have found that sometimes heavy loaded PostgreSQL begins to run all the
> queries slower than usual, sometimes - 5 and more times slower. I cannot
> reprocude that, but symptoms are the following: queries work very fast for
> 5-10 minutes, and after that - significant slowdown (every query, even a
> simple one, works 5-10 and ever more times slower), disk write activity
> grows too (but not everytime - I don't know exactli if there is a direct
> correlation). 2-3 seconds, then - performance restores back to normal.
>
> Autovacuum is turned off.
>
> Machine has 2 processors, huge memory, fast SCSI disks.
>
> I understand that there is too less information. Please advice what to
> monitor for better problem discovering.
>

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet ( www.astronet.ru ),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly

unexpected data beyond EOF and character encoding

From

Jaime Silvela

Date:

28 March 2007, 10:02:47

I'm doing dry runs to migrate my database from 8.1.3 to 8.2.3, with
pg_dumpall | gg_restore, using the executables from 8.2.3.
I'm seeing these messages

ERROR:  unexpected data beyond EOF in block 4506 of relation
"coverage_test_val"
HINT:  This has been seen to occur with buggy kernels; consider updating
your system.
CONTEXT:  COPY coverage_test_val, line 378036: "533359  61566
2007-02-01      DELINQUENCY6    16      0.356509255634  0.5     7"
ERROR:  unexpected data beyond EOF in block 414 of relation "rating"
HINT:  This has been seen to occur with buggy kernels; consider updating
your system.
CONTEXT:  COPY rating, line 56387: "840942      19548   2005-10-25
1       25      \N      1       \N"


The migration otherwise finishes successfully, but those two tables with
the EOF problem have count(*) = 0.

Can this really be a buggy kernel? Is there a list of known Linux
releases that are buggy? Of course, upgrading the kernel on a production
machine is the last thing I want to try...
And here's a theory: a while ago, someone installed pl/R on the database
to do statistics. Executing some pl/R functions would produce a crash of
all the connected clients, and I'd see "unexpected EOF" messages. I
immediately said we needed pl/R uninstalled, and the crashes stopped.
The encoding of my database is SQL_ASCII, which from what I've been
reading, is a non-encoding and will not perform any validation. Would it
be possible that some invalid characters crept  into tables related to
those pl/R crashes? Would porting to UTF8 fix this?

Is there a way I can fix those two cases of unexpected EOF so I can at
least perform the upgrade?

Thank you,
Jaime


***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice.  You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of:  (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

Re: unexpected data beyond EOF and character encoding

From

Tom Lane

Date:

28 March 2007, 11:43:20

Jaime Silvela <JSilvela@Bear.com> writes:
> I'm seeing these messages

> ERROR:  unexpected data beyond EOF in block 4506 of relation
> "coverage_test_val"
> HINT:  This has been seen to occur with buggy kernels; consider updating
> your system.

So what kernel are you running, on what hardware?  Is the problem
reproducible if you drop the target database and start over?  The
known cases of this problem have been under heavy concurrent update
load (multiple sessions trying to extend a table concurrently).
I don't entirely see how you could get it if the COPY is the only
thing writing that table ...

            regards, tom lane

Re: unexpected data beyond EOF and character encoding

From

Jaime Silvela

Date:

28 March 2007, 11:53:22

The kernel is Linux 2.6.5
I'll drop and redo this evening after hours - the migration takes a
couple of hours. The source database is under medium load, with a couple
of heavy updates, but the target database is a clean 8.2.3 install, and
only being written to by pg_restore.
So I take it that this happens on the restore part, not the dump part then?

Thanks
Jaime

Tom Lane wrote:
> Jaime Silvela <JSilvela@Bear.com> writes:
>
>> I'm seeing these messages
>>
>
>
>> ERROR:  unexpected data beyond EOF in block 4506 of relation
>> "coverage_test_val"
>> HINT:  This has been seen to occur with buggy kernels; consider updating
>> your system.
>>
>
> So what kernel are you running, on what hardware?  Is the problem
> reproducible if you drop the target database and start over?  The
> known cases of this problem have been under heavy concurrent update
> load (multiple sessions trying to extend a table concurrently).
> I don't entirely see how you could get it if the COPY is the only
> thing writing that table ...
>
>             regards, tom lane
>
>

***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice.  You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of:  (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

Re: unexpected data beyond EOF and character encoding

From

Tom Lane

Date:

28 March 2007, 12:02:50

Jaime Silvela <JSilvela@bear.com> writes:
> The kernel is Linux 2.6.5

2.6.5.what (give us full uname -a output please)?  On what hardware?
If memory serves, we had some reason to think that it only occurred on
specific 64-bit multi-CPU platforms, so I'm not just asking idly.
The error check was put in as a consequence of this thread:
http://archives.postgresql.org/pgsql-admin/2006-09/msg00250.php

> So I take it that this happens on the restore part, not the dump part then?

It's definitely a write failure not a read failure.  Per the code comments:

         * We get here only in the corner case where we are trying to extend
         * the relation but we found a pre-existing buffer marked BM_VALID.
         * This can happen because mdread doesn't complain about reads beyond
         * EOF --- which is arguably bogus, but changing it seems tricky ---
         * and so a previous attempt to read a block just beyond EOF could
         * have left a "valid" zero-filled buffer.  Unfortunately, we have
         * also seen this case occurring because of buggy Linux kernels that
         * sometimes return an lseek(SEEK_END) result that doesn't account for
         * a recent write.  In that situation, the pre-existing buffer would
         * contain valid data that we don't want to overwrite.  Since the
         * legitimate cases should always have left a zero-filled buffer,
         * complain if not PageIsNew.

            regards, tom lane

Re: unexpected data beyond EOF and character encoding

From

Jaime Silvela

Date:

28 March 2007, 12:17:19

Sorry, forgot about the hardware.
Here's the uname -a:
Linux wdlbc22r06 2.6.5-7.244-bigsmp #1 SMP Mon Dec 12 18:32:25 UTC 2005
i686 i686 i386 GNU/Linux

 From hwinfo --cpu
01: None 00.0: 10103 CPU
  [Created at cpu.290]
  Unique ID: rdCR.j8NaKXDZtZ6
  Hardware Class: cpu
  Arch: Intel
  Vendor: "GenuineIntel"
  Model: 15.4.3 "Intel(R) Xeon(TM) CPU 3.60GHz"
  Features:

fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,lm,pni,monitor,ds_cpl,tm2,est,cid
  Clock: 3600 MHz
  Cache: 2048 kb
  Units/Processor: 2
  Config Status: cfg=new, avail=yes, need=no, active=unknown

 From hwinfo --memory
01: None 00.0: 10102 Main Memory
  [Created at memory.59]
  Unique ID: rdCR.CxwsZFjVASF
  Hardware Class: memory
  Model: "Main Memory"
  Memory Range: 0x00000000-0xfff5ba3f (rw)
  Memory Size: 4 GB
  Config Status: cfg=no, avail=yes, need=no, active=unknown

We're running on 32-bit mode for compatibility with some libraries.
How can I determine whether this is due to a buggy kernel?

Tom Lane wrote:
> Jaime Silvela <JSilvela@bear.com> writes:
>
>> The kernel is Linux 2.6.5
>>
>
> 2.6.5.what (give us full uname -a output please)?  On what hardware?
> If memory serves, we had some reason to think that it only occurred on
> specific 64-bit multi-CPU platforms, so I'm not just asking idly.
> The error check was put in as a consequence of this thread:
> http://archives.postgresql.org/pgsql-admin/2006-09/msg00250.php
>
>
>> So I take it that this happens on the restore part, not the dump part then?
>>
>
> It's definitely a write failure not a read failure.  Per the code comments:
>
>          * We get here only in the corner case where we are trying to extend
>          * the relation but we found a pre-existing buffer marked BM_VALID.
>          * This can happen because mdread doesn't complain about reads beyond
>          * EOF --- which is arguably bogus, but changing it seems tricky ---
>          * and so a previous attempt to read a block just beyond EOF could
>          * have left a "valid" zero-filled buffer.  Unfortunately, we have
>          * also seen this case occurring because of buggy Linux kernels that
>          * sometimes return an lseek(SEEK_END) result that doesn't account for
>          * a recent write.  In that situation, the pre-existing buffer would
>          * contain valid data that we don't want to overwrite.  Since the
>          * legitimate cases should always have left a zero-filled buffer,
>          * complain if not PageIsNew.
>
>             regards, tom lane
>
>

***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice.  You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of:  (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

Re: unexpected data beyond EOF and character encoding

From

Tom Lane

Date:

28 March 2007, 12:31:46

Jaime Silvela <JSilvela@bear.com> writes:
> Here's the uname -a:
> Linux wdlbc22r06 2.6.5-7.244-bigsmp #1 SMP Mon Dec 12 18:32:25 UTC 2005
> i686 i686 i386 GNU/Linux

The previous thread says

>> SLES support helped upgrade our SLES 9, sp3, linux kernel from 2.6.5-7.244
>> to 2.6.5-7.282.  Since that we haven't had any blocks of rows beign
>> re-written or blanked out by the kernel.  The new kernel is handling the
>> wirtes much better.

While I can't say that his 2.6.5-7.244 is the same as yours, it does
seem a bit too close for comfort.

            regards, tom lane

Re: unexpected data beyond EOF and character encoding

From

Jaime Silvela

Date:

28 March 2007, 12:44:26

Wow, thanks! I'll definitely get the Unix operations guys to investigate.
Do you know if this is tracked as an issue for Linux? I'd like some
reassurance that whatever release Unix Ops is able to upgrade to fixes
the bug.

I'll try to get one of our machines upgraded and test there, and will
let you know if I find anything on my part.

Thanks again
Jaime

Tom Lane wrote:
> Jaime Silvela <JSilvela@bear.com> writes:
>
>> Here's the uname -a:
>> Linux wdlbc22r06 2.6.5-7.244-bigsmp #1 SMP Mon Dec 12 18:32:25 UTC 2005
>> i686 i686 i386 GNU/Linux
>>
>
> The previous thread says
>
>
>>> SLES support helped upgrade our SLES 9, sp3, linux kernel from 2.6.5-7.244
>>> to 2.6.5-7.282.  Since that we haven't had any blocks of rows beign
>>> re-written or blanked out by the kernel.  The new kernel is handling the
>>> wirtes much better.
>>>
>
> While I can't say that his 2.6.5-7.244 is the same as yours, it does
> seem a bit too close for comfort.
>
>             regards, tom lane
>
>

***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice.  You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of:  (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

Re: unexpected data beyond EOF and character encoding

From

Tom Lane

Date:

28 March 2007, 12:56:55

Jaime Silvela <JSilvela@bear.com> writes:
> Wow, thanks! I'll definitely get the Unix operations guys to investigate.
> Do you know if this is tracked as an issue for Linux? I'd like some
> reassurance that whatever release Unix Ops is able to upgrade to fixes
> the bug.

I did some idle searching when we were looking at the previous report
but couldn't find any specific kernel patches about this.  It's
presumably known on the kernel side but maybe not under the specific
"lseek reports stale EOF position" guise that we know it as.  If you
manage to track it down please do post back here, because we'd all
like to know.

            regards, tom lane

Re: Strange behaviour under heavy load

From

"Dmitry Koterov"

Date:

29 March 2007, 15:39:32

How to tune them?

Now I have:

checkpoint_segments = 5         # in logfile segments, min 1, 16MB each
checkpoint_timeout = 900        # range 30-3600, in seconds
checkpoint_warning = 30         # in seconds, 0 is off

No checkpoint warnings in pgsql logs.

On 3/27/07, Oleg Bartunov <oleg@sai.msu.su> wrote:

hmm,

looks like checkpoint ?
Check checkpoint settings in postgresql.conf.

Oleg
On Tue, 27 Mar 2007, Dmitry Koterov wrote:

> Hello.
>
> I have found that sometimes heavy loaded PostgreSQL begins to run all the
> queries slower than usual, sometimes - 5 and more times slower. I cannot
> reprocude that, but symptoms are the following: queries work very fast for
> 5-10 minutes, and after that - significant slowdown (every query, even a
> simple one, works 5-10 and ever more times slower), disk write activity
> grows too (but not everytime - I don't know exactli if there is a direct
> correlation). 2-3 seconds, then - performance restores back to normal.
>
> Autovacuum is turned off.
>
> Machine has 2 processors, huge memory, fast SCSI disks.
>
> I understand that there is too less information. Please advice what to
> monitor for better problem discovering.
>

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru ),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly