Re: RFC: replace pg_stat_activity.waiting with something more descriptive - Mailing list pgsql-hackers

From Vladimir Borodin
Subject Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Date
Msg-id EEE78E40-0E48-411A-9F90-CF9339DA9698@simply.name
Whole thread Raw
In response to Re: RFC: replace pg_stat_activity.waiting with something more descriptive  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: RFC: replace pg_stat_activity.waiting with something more descriptive  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers

16 сент. 2015 г., в 20:52, Robert Haas <robertmhaas@gmail.com> написал(а):

On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
Yes, the major question is cost. But I think we should validate our thoughts
by experiments assuming there are more possible synchronization protocols.
Ildus posted implemention of double buffering approach that showed quite low
cost.

I'm not sure exactly which email you are referring to, but I don't
believe that anyone has done experiments that are anywhere near
comprehensive enough to convince ourselves that this won't be a
problem.  If a particular benchmark doesn't show an issue, that can
just mean that the benchmark isn't hitting the case where there is a
problem.  For example, EDB has had customers who have severe
contention apparently on the buffer content lwlocks, resulting in big
slowdowns.  You don't see that in, say, a pgbench run.  But for people
who have certain kinds of queries, it's really bad.  Those sort of
loads, where the lwlock system really gets stressed, are cases where
adding overhead seems likely to pinch.

Alexander and Ildus gave us two patches for REL9_4_STABLE - one with sampling every N milliseconds and one with measuring timings. We have tested both of them on two kinds of our workload besides pgbench runs. One workload is OLTP when all the data fits in shared buffers, synchronous_commit if off and the bottleneck is ProcArrayLock (all backtraces look something like the following):

[Thread debugging using libthread_db enabled]
0x00007f10e01d4f27 in semop () from /lib64/libc.so.6
#0  0x00007f10e01d4f27 in semop () from /lib64/libc.so.6
#1  0x000000000061fe27 in PGSemaphoreLock (sema=0x7f11f2d4f430, interruptOK=0 '\000') at pg_sema.c:421
#2  0x00000000006769ba in LWLockAcquireCommon (l=0x7f10e1e00120, mode=LW_EXCLUSIVE) at lwlock.c:626
#3  LWLockAcquire (l=0x7f10e1e00120, mode=LW_EXCLUSIVE) at lwlock.c:467
#4  0x0000000000667862 in ProcArrayEndTransaction (proc=0x7f11f2d4f420, latestXid=182562881) at procarray.c:404
#5  0x00000000004b579b in CommitTransaction () at xact.c:1957
#6  0x00000000004b6ae5 in CommitTransactionCommand () at xact.c:2727
#7  0x00000000006819d9 in finish_xact_command () at postgres.c:2437
#8  0x0000000000684f05 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x21e1a70 "xivadb", username=<value optimized out>) at postgres.c:4270
#9  0x0000000000632d7d in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4155
#10 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:3829
#11 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1597
#12 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1244
#13 0x00000000005cadb8 in main (argc=3, argv=0x21e0aa0) at main.c:228


Another is when all the data fits in RAM but does not fit in shared buffers and the bottleneck is mostly BufFreelistLock with sensible contention around buffer partitions locking and buffers locking. Taking several backtraces with GDB of several backends and doing some bash magic gives the following:

root@pgload01g ~ # grep '^#4 ' /tmp/bt | awk '{print $2, $4, $NF}' | sort | uniq -c | sort -rn    126 0x000000000065db61 BufferAlloc bufmgr.c:591     67 0x000000000065e03a BufferAlloc bufmgr.c:760     43 0x00000000005c8c3b pq_getbyte pqcomm.c:899     39 0x000000000065dd93 BufferAlloc bufmgr.c:765      6 0x00000000004b52bb RecordTransactionCommit xact.c:1194      4 0x000000000065da0e ReadBuffer_common bufmgr.c:476      1 ReadBuffer_common relpersistence=112 bufmgr.c:340      1 exec_eval_expr expr=0x166e908, pl_exec.c:4796      1 0x00007f78b8cb217b ?? /usr/pgsql-9.4/lib/pg_stat_statements.so      1 0x00000000005d4cbb _copyList copyfuncs.c:3849
root@pgload01g ~ #

For both scenarios on linux we got approximately the same results - version with timings was faster then version with sampling (sampling was done every 10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version with timings gave ~14500 tps while version with sampling gave ~13800 tps. In all cases processor was 100% utilized. Comparing vanilla PostgreSQL and version with timings on constant workload (12000 tps) gave the following results in latencies for queries:

q'th vanilla timing 99% 79.0 97.0 (+18.0) 98% 64.0 76.0 (+12.0) 95% 38.0 47.0 (+9.0) 90% 16.0 21.0 (+5.0) 85% 7.0 11.0 (+4.0) 80% 5.0 7.0 (+2.0) 75% 4.0 5.0 (+1.0) 50% 2.0 3.0 (+1.0)

And it that test version with timings consumed about 7% more of CPU. Does it seem to be the results on workload where lwlock system is stressed?

And when the data does not fit in RAM you really don’t see much difference between all three version because your contention is moved from lwlock system to I/O, even with newest NVMe SSDs, or at least is divided between lwlocks and other waits events.


Yes, but some competing products also provides comprehensive waits
monitoring too. That makes me think it should be possible for us too.

I agree, but keep in mind that some of those products may use
techniques to reduce the overhead that we don't have available.  I
have a strong suspicion that one of those products in particular has
done something clever to make measuring the time cheap on all
platforms.  Whatever that clever thing is, we haven't done it.  So
that matters.

I don’t know for all the products but Oracle didn’t make something clever, they exactly use gettimeofday on linux - strace proofs that.


I think the reason for hooks could be not only disagreements about design,
but platform dependent issues too.
Next step after we have view with current wait events will be gathering some
statistics of them. We can oppose at least two approaches here:
1) Periodical sampling of current wait events.
2) Measure each wait event duration. We could collect statistics for short
period locally and update shared memory structure periodically (using some
synchronization protocol).

In the previous attempt to gather lwlocks statistics, you predict that
sampling could have a significant overhead [1]. In contrast, on many systems
time measurements are cheap. We have implemented both approaches and it
shows that sampling every 1 milliseconds produce higher overhead than
individual duration measurements for each wait event. We can share another
version of waits monitoring based on sampling to make these results
reproducible for everybody. However, cheap time measurements are available
not for each platform. For instance, ISTM that on Windows time measurements
are too expensive [2].

That makes me think that we need pluggable solution, at least for
statistics: direct measuring of events durations for majority of systems and
sampling for others as the least harm.

To me, those seem like arguments for making it configurable, but not
necessarily for having hooks.

We were discussing it with Bruce at pgday.ru and AFAIK Bruce said that it would be enough to have the ability to turn measuring timings off with a GUC not to have significant overhead on some platforms. Isn’t it a good approach?

BTW, we have also tested with pgbench and on two mentioned earlier workloads version with waits monitoring turned off, we haven’t seen any overhead in all tests.


I think it's reasonable to consider reporting this data in the PGPROC
using a 4-byte integer rather than reporting it through a singe byte
in the backend status structure.  I believe that addresses the
concerns about reporting from auxiliary processes, and it also allows
a little more data to be reported.  For anything in excess of that, I
think we should think rather harder.  Most likely, such addition
detail should be reported only for certain types of wait events, or on
a delay, or something like that, so that the core mechanism remains
really, really fast.

That sounds reasonable. There are many pending questions, but it seems like
step forward to me.

Great, let's do it.  I think we should probably do the work to
separate the non-individual lwlocks into tranches first, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--
Да пребудет с вами сила…

pgsql-hackers by date:

Previous
From: Haribabu Kommi
Date:
Subject: Re: Parallel Seq Scan
Next
From: Fabien COELHO
Date:
Subject: Re: extend pgbench expressions with functions