Thread: Context-switch storm in 8.1.15

Context-switch storm in 8.1.15

From
Iñigo Martinez Lasala
Date:
Hi everybody.

Recently our company has been granted with a contract for an on-line store
mainteinance.
The website has been developed under J2EE and Postgres 8.1 as database
backend. The system has been working without problem for several month,
but with Christmas access to web portal has raised a lot.
The database suffers of a performance problem on high load. Lot of context
switch happens reaching up to 200.000 cs per second.
This system is a 16GB, 4 CPU intel Xeon MP with HT enabled and a RAID10
iSCSI storage, kernel 2.4.21 (RHAS 3).

Half of CPU power is lost on system time, as you can see.

Vmstat on high load
19  0      0 281852 150316 13732396    0    0    32    80 1071 128209 41
43 16  0
75  0      0 282040 150316 13732396    0    0     0     0  719 148023 40
38 22  0
 3  0      0 284208 150324 13732412    0    0    16   484  728 145371 39
40 21  0
12  0      0 278364 150324 13732508    0    0    80    56  660 157533 35
42 23  1
 6  0      0 284972 150324 13732580    0    0    32   200  685 142014 39
41 20  0
 8  0      0 296424 150324 13732624    0    0    40   136  554 139601 41
39 20  0
85  0      0 265004 150324 13732664    0    0    32    48  642 142437 48
32 20  0
32  0      0 267432 150324 13732680    0    0     0   788 1003 144409 37
42 21  0
13  0      0 270468 150324 13732676    0    0     0    24  724 146663 42
40 19

Vmstat after 20 seconds after stopping portal:
 8  0      0 962388 206744 13771548    0    0     0     0  131 199784 11
38 51  0
 3  0      0 970212 206744 13771548    0    0     0  1856  305 203639 12
40 48  0
10  0      0 975036 206744 13771588    0    0     0   128  212 201899 11
36 52  0
 3  0      0 970272 206744 13771652    0    0    16   232  685 202672 14
41 44  0
 6  0      0 1008320 206744 13771656    0    0     0    40  198 196298 14
46 39  0
 3  0      0 1034836 206744 13771656    0    0     0     0  147 202731 12
39 50  0
 3  0      0 1037764 206752 13771656    0    0     0   952  202 202933 11
39 50  0
 5  0      0 1078132 206752 13771656    0    0     0     0  154 203408 18
35 47  0
 6  0      0 1110572 206752 13771656    0    0     0     0  153 196864 18
41 41  0
 4  0      0 1105440 206752 13771824    0    0    16   592  461 207538 12
37 51  1


I've read about this problem with version prior 8.2. However at this
moment is not possible to migrate to 8.2 due to the amount of stored
procedures and  we don't have time enough to test ALL procedures in order
to migrate to 8.2 (or 8.3).
However we have performed light tests with 8.2 on high load an this
problem has been solved or mitigated.

Now the question. Is there any backport patch for 8.1 that solves
context-switch storm?

The patch I'm looking for is this or a similar one(this one is for 8.2):
---
A Itagaki Takahiro/Tom Lane patch which arranges for GetSnapshotData
  to copy live-subtransaction XIDs from the PGPROC array into
  snapshots, and use this information to avoid visits to pg_subtrans
  in HeapTupleSatisfiesSnapshot.  This appears to solve the
  pg_subtrans-related context swap storm problem that's been reported
  by several people for 8.1.  While at it, modify GetSnapshotData to
  not take an exclusive lock on ProcArrayLock, as closer analysis
  shows that shared lock is always sufficient.
---

Thanks in advance.


Re: Context-switch storm in 8.1.15

From
"Scott Marlowe"
Date:
On Tue, Dec 30, 2008 at 4:02 AM, Iñigo Martinez Lasala
<imartinez@vectorsf.com> wrote:
> Hi everybody.
>
> Recently our company has been granted with a contract for an on-line store
> mainteinance.
> The website has been developed under J2EE and Postgres 8.1 as database
> backend. The system has been working without problem for several month,
> but with Christmas access to web portal has raised a lot.
> The database suffers of a performance problem on high load. Lot of context
> switch happens reaching up to 200.000 cs per second.
> This system is a 16GB, 4 CPU intel Xeon MP with HT enabled and a RAID10
> iSCSI storage, kernel 2.4.21 (RHAS 3).
>
> Half of CPU power is lost on system time, as you can see.
>
> Vmstat on high load
> 19  0      0 281852 150316 13732396    0    0    32    80 1071 128209 41
> 43 16  0
> 75  0      0 282040 150316 13732396    0    0     0     0  719 148023 40
> 38 22  0
>  3  0      0 284208 150324 13732412    0    0    16   484  728 145371 39
> 40 21  0
> 12  0      0 278364 150324 13732508    0    0    80    56  660 157533 35
> 42 23  1
>  6  0      0 284972 150324 13732580    0    0    32   200  685 142014 39
> 41 20  0
>  8  0      0 296424 150324 13732624    0    0    40   136  554 139601 41
> 39 20  0
> 85  0      0 265004 150324 13732664    0    0    32    48  642 142437 48
> 32 20  0
> 32  0      0 267432 150324 13732680    0    0     0   788 1003 144409 37
> 42 21  0
> 13  0      0 270468 150324 13732676    0    0     0    24  724 146663 42
> 40 19
>
> Vmstat after 20 seconds after stopping portal:
>  8  0      0 962388 206744 13771548    0    0     0     0  131 199784 11
> 38 51  0
>  3  0      0 970212 206744 13771548    0    0     0  1856  305 203639 12
> 40 48  0
> 10  0      0 975036 206744 13771588    0    0     0   128  212 201899 11
> 36 52  0
>  3  0      0 970272 206744 13771652    0    0    16   232  685 202672 14
> 41 44  0
>  6  0      0 1008320 206744 13771656    0    0     0    40  198 196298 14
> 46 39  0
>  3  0      0 1034836 206744 13771656    0    0     0     0  147 202731 12
> 39 50  0
>  3  0      0 1037764 206752 13771656    0    0     0   952  202 202933 11
> 39 50  0
>  5  0      0 1078132 206752 13771656    0    0     0     0  154 203408 18
> 35 47  0
>  6  0      0 1110572 206752 13771656    0    0     0     0  153 196864 18
> 41 41  0
>  4  0      0 1105440 206752 13771824    0    0    16   592  461 207538 12
> 37 51  1
>
>
> I've read about this problem with version prior 8.2. However at this
> moment is not possible to migrate to 8.2 due to the amount of stored
> procedures and  we don't have time enough to test ALL procedures in order
> to migrate to 8.2 (or 8.3).
> However we have performed light tests with 8.2 on high load an this
> problem has been solved or mitigated.

Are you using connection pooling, or do you have a whole bunch of
connections at once?  How many connections do you have that are idle
versus active?

> Now the question. Is there any backport patch for 8.1 that solves
> context-switch storm?

It's far more likely that a back ported 8.1.x would have problems than
you'd run into issues with 8.2 or 8.3 with stored procs.  I'd skip 8.2
and go straight to testing on 8.3.  We upgraded from 8.1 to 8.3 on our
production database.  The only issue we had was that a lot of implicit
casts had been removed, and some older code relied on an explicit date
:: text cast that had gone away.  Since relying on date being a text
string is bad form anyway, we fixed the code and went on from there.

Usually when something like this doesn't get back patched, it's
because the code base was so different in that area that backporting
it represents a real danger to the code stability.

If you upgrade to 8.3 you're upgrading to a stable release that solves
your problems.  If you backport the patch to 8.1 you're running a
version tested only by you.