Thread: Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling fromrw-conflict tracking in serializable transactions

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling fromrw-conflict tracking in serializable transactions

From

"Mengxing Liu"

Date:

02 June 2017, 11:16:18

Hi,  Alvaro and Kevin.

> Anyway, this is just my analysis. 
> So I want to hack the PG and count the conflict lists' size of transactions. That would be more accurate.

In the last week, I hacked the PG to add an additional thread to count RWConflict list lengths. 
And tune the benchmark to get more conflict. But the result is still not good.

> 
> > 
> > Yeah, you need a workload that generates a longer conflict list -- if
> > you can make the tool generate a conflict list with a configurable
> > length, that's even better (say, 100 conflicts vs. 1000 conflicts).
> > Then we can see how the conflict list processing scales.
> > 
> 
> Yes, I tried to increase the read set to make more conflicts. 
> However the abort ratio will also increase. The CPU cycles consumed by conflict tracking are still less than 1%.
> According to the design of PG, a transaction will be aborted if there is a rw-antidependency. 
> In this case, a transaction with a longer conflict list, is more possible to be aborted.
> That means, the conflict list doesn't have too many chances to grow too long. 
> 

To solve this problem, I use just two kinds of transactions: Read-only transactions and Update-only transactions.
In this case, no transaction would  have an In-RWconflict and an Out-RWconflict at the same time.  
Thus transactions would not be aborted by conflict checking. 

Specifically, The benchmark is like this:
The table has 10K rows. 
Read-only transactions read 1K rows and Update-only transactions update 20 random rows of the table. 

In this benchmark, about 91% lists are shorter than 10; 
lengths of 6% conflict lists are between 10 and 20. Only 2% lists are longer than 20. The CPU utilization of
CheckForSerializableConflictOut/Inis 0.71%/0.69%.
 

I tried to increase the write set. As a result, conflict list become longer. But the total CPU utilization is decreased
(about50%).
 
CPU is not the bottleneck anymore. I'm not familiar with other part of PG. Is it caused by LOCK? Is there any chance to
getrid of this problem?
 

BTW, I found that the email is not very convenient, especially when I  have some problem and want to discuss with you.
Would you mind scheduling a meeting every week by Skye, or other instant message software you like?
I would not take you too much time. Maybe one hour is enough.   


Sincerely.
--
Mengxing Liu

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling fromrw-conflict tracking in serializable transactions

From

Alvaro Herrera

Date:

02 June 2017, 18:08:46

Mengxing Liu wrote:
> Hi,  Alvaro and Kevin.
> 
> > Anyway, this is just my analysis. 
> > So I want to hack the PG and count the conflict lists' size of transactions. That would be more accurate.
> 
> In the last week, I hacked the PG to add an additional thread to count RWConflict list lengths. 
> And tune the benchmark to get more conflict. But the result is still not good.

Kevin mentioned during PGCon that there's a paper by some group in
Sydney that developed a benchmark on which this scalability problem
showed up very prominently.  I think your first step should be to
reproduce their results -- my recollection is that Kevin says you
already know that paper, so please dedicate some time to analyze it and
reproduce their workload.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling fromrw-conflict tracking in serializable transactions

From

Kevin Grittner

Date:

02 June 2017, 20:44:16

> Mengxing Liu wrote:

>> The CPU utilization of CheckForSerializableConflictOut/In is
>> 0.71%/0.69%.

How many cores were on the system used for this test?  The paper
specifically said that they didn't see performance degradation on
the PostgreSQL implementation until 32 concurrent connections with
24 or more cores.  The goal here is to fix *scaling* problems.  Be
sure you are testing at an appropriate scale.  (The more sockets,
cores, and clients, the better, I think.)

On Fri, Jun 2, 2017 at 10:08 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

> Kevin mentioned during PGCon that there's a paper by some group in
> Sydney that developed a benchmark on which this scalability
> problem showed up very prominently.

https://pdfs.semanticscholar.org/6c4a/e427e6f392b7dec782b7a60516f0283af1f5.pdf

"[...] we see a much better scalability of pgSSI than the
corresponding implementations on InnoDB. Still, the overhead of
pgSSI versus standard SI increases significantly with higher MPL
than one would normally expect, reaching around 50% with MPL 128."

"Our profiling showed that PostgreSQL spend 2.3% of the overall
runtime in traversing these list, plus 10% of its runtime waiting on
the corresponding kernel mutexes."

If you cannot duplicate their results, you might want to contact the
authors for more details on their testing methodology.

Note that the locking around access to the list is likely to be a
bigger problem than the actual walking and manipulation of the list
itself.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scalingfrom rw-conflict tracking in serializable transactions

From

"Mengxing Liu"

Date:

03 June 2017, 09:51:54

> -----Original Messages-----
> From: "Kevin Grittner" <kgrittn@gmail.com>
> Sent Time: 2017-06-03 01:44:16 (Saturday)
> To: "Alvaro Herrera" <alvherre@2ndquadrant.com>
> Cc: "Mengxing Liu" <liu-mx15@mails.tsinghua.edu.cn>, "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
> Subject: Re: Re: Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking in serializable
transactions
> 
> > Mengxing Liu wrote:
> 
> >> The CPU utilization of CheckForSerializableConflictOut/In is
> >> 0.71%/0.69%.
> 
> How many cores were on the system used for this test?  The paper
> specifically said that they didn't see performance degradation on
> the PostgreSQL implementation until 32 concurrent connections with
> 24 or more cores.  The goal here is to fix *scaling* problems.  Be
> sure you are testing at an appropriate scale.  (The more sockets,
> cores, and clients, the better, I think.)
> 
> 

I used 15 cores for server and 50 clients. 
I tried 30 cores. But the CPU utilization is about 45%~70%. 
How can we distinguish  where the problem is? Is disk I/O or Lock ?

> On Fri, Jun 2, 2017 at 10:08 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> 
> > Kevin mentioned during PGCon that there's a paper by some group in
> > Sydney that developed a benchmark on which this scalability
> > problem showed up very prominently.
> 
> https://pdfs.semanticscholar.org/6c4a/e427e6f392b7dec782b7a60516f0283af1f5.pdf
> 
> "[...] we see a much better scalability of pgSSI than the
> corresponding implementations on InnoDB. Still, the overhead of
> pgSSI versus standard SI increases significantly with higher MPL
> than one would normally expect, reaching around 50% with MPL 128."
> 

Actually, I implemented the benchmark described in the paper at first. I reported the result in a previous email.
The problem is that the transaction with longer conflict list is easier to be aborted, so the conflict list can not
growtoo long.

I modify the benchmark by using Update-only transaction and Read-only transaction to get rid of this problem. In this
way,dangerous structure will never be generated.

> "Our profiling showed that PostgreSQL spend 2.3% of the overall
> runtime in traversing these list, plus 10% of its runtime waiting on
> the corresponding kernel mutexes."
> 

Does "traversing these list" means the function "RWConflictExists"? 
And "10% waiting on the corresponding kernel mutexes" means the lock in the function CheckForSerializableIn/out ?

> If you cannot duplicate their results, you might want to contact the
> authors for more details on their testing methodology.
> 

If I used 30 cores for server, and 90 clients, RWConflictExists consumes 1.0% CPU cycles, and
CheckForSerializableIn/outtakes 5% in all. 

But the total CPU utilization of PG is about 50%. So the result seems to be matched. 
If we can solve this problem, we can use this benchmark for the future work.

Sincerely

--
Mengxing Liu

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflicttracking in serializable transactions

From

Kevin Grittner

Date:

04 June 2017, 00:15:09

On Sat, Jun 3, 2017 at 1:51 AM, Mengxing Liu
<liu-mx15@mails.tsinghua.edu.cn> wrote:

> I tried 30 cores. But the CPU utilization is about 45%~70%.
> How can we distinguish  where the problem is? Is disk I/O or Lock?

A simple way is to run `vmstat 1` for a bit during the test.  Can
you post a portion of the output of that here?  If you can configure
the WAL directory to a separate mount point (e.g., use the --waldir
option of initdb), a snippet of `iostat 1` output would be even
better.

I think the best thing may be if you can generate a CPU flame graph
of the worst case you can make for these lists:
http://www.brendangregg.com/flamegraphs.html  IMO, such a graph
highlights the nature of the problem better than anything else.

> Does "traversing these list" means the function "RWConflictExists"?
> And "10% waiting on the corresponding kernel mutexes" means the
> lock in the function CheckForSerializableIn/out ?

Since they seemed to be talking specifically about the conflict
list, I had read that as SerializableXactHashLock, although the
wording is a bit vague -- they may mean something more inclusive.

> If I used 30 cores for server, and 90 clients, RWConflictExists
> consumes 1.0% CPU cycles, and CheckForSerializableIn/out takes 5%
> in all.
> But the total CPU utilization of PG is about 50%. So the result
> seems to be matched.
> If we can solve this problem, we can use this benchmark for the
> future work.

If you can get a flame graph of CPU usage on this load, that would
be ideal.  At that point, we can discuss what is best to attack.
Reducing something that is 10% of the total PostgreSQL CPU load in a
particular workload sounds like it could still have significant
value, although if you see a way to attack the other 90% (or some
portion of it larger than 10%) instead, I think we could adjust the
scope based on those results.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions

From

"Mengxing Liu"

Date:

04 June 2017, 19:27:51



> -----Original Messages-----
> From: "Kevin Grittner" <kgrittn@gmail.com>

> > I tried 30 cores. But the CPU utilization is about 45%~70%.
> > How can we distinguish  where the problem is? Is disk I/O or Lock?
> 
> A simple way is to run `vmstat 1` for a bit during the test.  Can
> you post a portion of the output of that here?  If you can configure
> the WAL directory to a separate mount point (e.g., use the --waldir
> option of initdb), a snippet of `iostat 1` output would be even
> better.

"vmstat 1" output is as follow. Because I used only 30 cores (1/4 of all),  cpu user time should be about 12*4 = 48. 
There seems to be no process blocked by IO. 

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
28  0      0 981177024 315036 70843760    0    0     0     9    0    0  1  0 99  0  0
21  1      0 981178176 315036 70843784    0    0     0     0 25482 329020 12  3 85  0  0
18  1      0 981179200 315036 70843792    0    0     0     0 26569 323596 12  3 85  0  0
17  0      0 981175424 315036 70843808    0    0     0     0 25374 322992 12  4 85  0  0
12  0      0 981174208 315036 70843824    0    0     0     0 24775 321577 12  3 85  0  0
 8  0      0 981179328 315036 70845336    0    0     0     0 13115 199020  6  2 92  0  0
13  0      0 981179200 315036 70845792    0    0     0     0 22893 301373 11  3 87  0  0
11  0      0 981179712 315036 70845808    0    0     0     0 26933 325728 12  4 84  0  0
30  0      0 981178304 315036 70845824    0    0     0     0 23691 315821 11  4 85  0  0
12  1      0 981177600 315036 70845832    0    0     0     0 29485 320166 12  4 84  0  0
32  0      0 981180032 315036 70845848    0    0     0     0 25946 316724 12  4 84  0  0
21  0      0 981176384 315036 70845864    0    0     0     0 24227 321938 12  4 84  0  0
21  0      0 981178880 315036 70845880    0    0     0     0 25174 326943 13  4 83  0  0

I used ramdisk to speedup the disk IO. Therefore, iostat can not give useful information. 

> I think the best thing may be if you can generate a CPU flame graph
> of the worst case you can make for these lists:
> http://www.brendangregg.com/flamegraphs.html  IMO, such a graph
> highlights the nature of the problem better than anything else.
> 

The flame graph is attached. I use 'perf' to generate the flame graph. Only the CPUs running PG server are profiled. 
I'm not familiar with other part of PG. Can you find anything unusual in the graph?


--
Mengxing Liu








-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

perf-kernel.svg

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling fromrw-conflict tracking in serializable transactions

From

"Mengxing Liu"

Date:

06 June 2017, 19:16:43

Hi, Kevin and Alvaro. 

I think disk I/O is not the bottleneck in our experiment, but the global lock is. 

For disk I/O, there are two evidences:
1) The total throughput is not more than 10 Ktps. Only a half are update transactions. An update transaction modifies
20tuples; each tuple's size is about 100B.  
 
So the data written to the disk should be less than 10MB per second. Even we take write-ahead log in consideration
(justdouble it), the data should be less than 20MB/s. 
 
I replaced ramdisk with SSD, and "iostat" shows the same result, while our SSD's sequential write speed is more than
700MB/s.
 
2) I changed isolation level from "serializable" to "read committed". As the isolation requirement becomes looser,
throughputis increased. But in this case, the CPU utilization is nearly 100%. (It's about 50% in serializable model)
 
Therefore, disk I/O is not the bottleneck.

For the lock:
I read the source code in predicate.c; I found many functions use a global lock:  SerializableXactHashLock. So there is
onlyone process on working at any time!
 
As the problem of CPU not being fully used just happened after I had changed isolation level to "serailizable", this
globallock should be the bottleneck.
 
Unfortunately, "perf" seems unable to record time waiting for locks.
I did it by hand.  Specifically, I use function "gettimeofday" just before acquiring locks and after releasing locks. 
In this way, I found function CheckForSerializableIn/CheckForSerializableOut takes more than 10% of running time, which
isfar bigger than what reported by perf in the last email.
 

If my statement is right, it sounds like good news as we find where the problem is.
Kevin has mentioned that the lock is used to protect the linked list. So I want to replace the linked list with hash
tablein the next step. After that I will try to remove this lock carefully.
 
But  in this way, our purpose has been changed. O(N^2) tracking is not the bottleneck, the global lock is.

By the way, using "gettimeofday" to profile is really ugly. 
Perf lock can only record kernel mutex, and requires kernel support, so I didn't use it.
Do you have any good idea about profiling time waiting for locks?


> -----Original Messages-----
> From: "Mengxing Liu" <liu-mx15@mails.tsinghua.edu.cn>
> Sent Time: 2017-06-05 00:27:51 (Monday)
> To: "Kevin Grittner" <kgrittn@gmail.com>
> Cc: "Alvaro Herrera" <alvherre@2ndquadrant.com>, "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
> Subject: Re: Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking in serializable
transactions
> 
> 
> 
> 
> > -----Original Messages-----
> > From: "Kevin Grittner" <kgrittn@gmail.com>
> 
> > > I tried 30 cores. But the CPU utilization is about 45%~70%.
> > > How can we distinguish  where the problem is? Is disk I/O or Lock?
> > 
> > A simple way is to run `vmstat 1` for a bit during the test.  Can
> > you post a portion of the output of that here?  If you can configure
> > the WAL directory to a separate mount point (e.g., use the --waldir
> > option of initdb), a snippet of `iostat 1` output would be even
> > better.
> 
> "vmstat 1" output is as follow. Because I used only 30 cores (1/4 of all),  cpu user time should be about 12*4 = 48.

> There seems to be no process blocked by IO. 
> 
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 28  0      0 981177024 315036 70843760    0    0     0     9    0    0  1  0 99  0  0
> 21  1      0 981178176 315036 70843784    0    0     0     0 25482 329020 12  3 85  0  0
> 18  1      0 981179200 315036 70843792    0    0     0     0 26569 323596 12  3 85  0  0
> 17  0      0 981175424 315036 70843808    0    0     0     0 25374 322992 12  4 85  0  0
> 12  0      0 981174208 315036 70843824    0    0     0     0 24775 321577 12  3 85  0  0
>  8  0      0 981179328 315036 70845336    0    0     0     0 13115 199020  6  2 92  0  0
> 13  0      0 981179200 315036 70845792    0    0     0     0 22893 301373 11  3 87  0  0
> 11  0      0 981179712 315036 70845808    0    0     0     0 26933 325728 12  4 84  0  0
> 30  0      0 981178304 315036 70845824    0    0     0     0 23691 315821 11  4 85  0  0
> 12  1      0 981177600 315036 70845832    0    0     0     0 29485 320166 12  4 84  0  0
> 32  0      0 981180032 315036 70845848    0    0     0     0 25946 316724 12  4 84  0  0
> 21  0      0 981176384 315036 70845864    0    0     0     0 24227 321938 12  4 84  0  0
> 21  0      0 981178880 315036 70845880    0    0     0     0 25174 326943 13  4 83  0  0
> 
> I used ramdisk to speedup the disk IO. Therefore, iostat can not give useful information. 
> 
> > I think the best thing may be if you can generate a CPU flame graph
> > of the worst case you can make for these lists:
> > http://www.brendangregg.com/flamegraphs.html  IMO, such a graph
> > highlights the nature of the problem better than anything else.
> > 
> 
> The flame graph is attached. I use 'perf' to generate the flame graph. Only the CPUs running PG server are profiled.

> I'm not familiar with other part of PG. Can you find anything unusual in the graph?
> 
> 
> --
> Mengxing Liu
> 
> 
> 
> 
> 
> 
> 


--
Mengxing Liu

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflicttracking in serializable transactions

From

Robert Haas

Date:

07 June 2017, 20:30:58

On Tue, Jun 6, 2017 at 12:16 PM, Mengxing Liu
<liu-mx15@mails.tsinghua.edu.cn> wrote:
> I think disk I/O is not the bottleneck in our experiment, but the global lock is.

A handy way to figure this kind of thing out is to run a query like
this repeatedly during the benchmark:

SELECT wait_event_type, wait_event FROM pg_stat_activity;

I often do this by using psql's \watch command, often \watch 0.5 to
run it every half-second.  I save all the results collected during the
benchmark using 'script' and then analyze them to see which wait
events are most frequent.  If your theory is right, you ought to see
that SerializableXactHashLock occurs as a wait event very frequently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling fromrw-conflict tracking in serializable transactions

From

Kevin Grittner

Date:

07 June 2017, 21:10:27

On Sun, Jun 4, 2017 at 11:27 AM, Mengxing Liu
<liu-mx15@mails.tsinghua.edu.cn> wrote:

> "vmstat 1" output is as follow. Because I used only 30 cores (1/4 of all),  cpu user time should be about 12*4 = 48.
> There seems to be no process blocked by IO.
>
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 28  0      0 981177024 315036 70843760    0    0     0     9    0    0  1  0 99  0  0
> 21  1      0 981178176 315036 70843784    0    0     0     0 25482 329020 12  3 85  0  0
> 18  1      0 981179200 315036 70843792    0    0     0     0 26569 323596 12  3 85  0  0
> 17  0      0 981175424 315036 70843808    0    0     0     0 25374 322992 12  4 85  0  0
> 12  0      0 981174208 315036 70843824    0    0     0     0 24775 321577 12  3 85  0  0
>  8  0      0 981179328 315036 70845336    0    0     0     0 13115 199020  6  2 92  0  0
> 13  0      0 981179200 315036 70845792    0    0     0     0 22893 301373 11  3 87  0  0
> 11  0      0 981179712 315036 70845808    0    0     0     0 26933 325728 12  4 84  0  0
> 30  0      0 981178304 315036 70845824    0    0     0     0 23691 315821 11  4 85  0  0
> 12  1      0 981177600 315036 70845832    0    0     0     0 29485 320166 12  4 84  0  0
> 32  0      0 981180032 315036 70845848    0    0     0     0 25946 316724 12  4 84  0  0
> 21  0      0 981176384 315036 70845864    0    0     0     0 24227 321938 12  4 84  0  0
> 21  0      0 981178880 315036 70845880    0    0     0     0 25174 326943 13  4 83  0  0

This machine has 120 cores?  Is hyperthreading enabled?  If so, what
you are showing might represent a total saturation of the 30 cores.
Context switches of about 300,000 per second is pretty high.  I can't
think of when I've seen that except when there is high spinlock
contention.

Just to put the above in context, how did you limit the test to 30
cores?  How many connections were open during the test?

> The flame graph is attached. I use 'perf' to generate the flame graph. Only the CPUs running PG server are profiled.
> I'm not familiar with other part of PG. Can you find anything unusual in the graph?

Two SSI functions stand out:
10.86% PredicateLockTuple3.51% CheckForSerializableConflictIn

In both cases, most of that seems to go to lightweight locking.  Since
you said this is a CPU graph, that again suggests spinlock contention
issues.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling fromrw-conflict tracking in serializable transactions

From

"Mengxing Liu"

Date:

08 June 2017, 08:07:01


> From: "Kevin Grittner" <kgrittn@gmail.com>
> <liu-mx15@mails.tsinghua.edu.cn> wrote:
> 
> > "vmstat 1" output is as follow. Because I used only 30 cores (1/4 of all),  cpu user time should be about 12*4 =
48.
> > There seems to be no process blocked by IO.
> >
> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > 28  0      0 981177024 315036 70843760    0    0     0     9    0    0  1  0 99  0  0
> > 21  1      0 981178176 315036 70843784    0    0     0     0 25482 329020 12  3 85  0  0
> > 18  1      0 981179200 315036 70843792    0    0     0     0 26569 323596 12  3 85  0  0
> > 17  0      0 981175424 315036 70843808    0    0     0     0 25374 322992 12  4 85  0  0
> > 12  0      0 981174208 315036 70843824    0    0     0     0 24775 321577 12  3 85  0  0
> >  8  0      0 981179328 315036 70845336    0    0     0     0 13115 199020  6  2 92  0  0
> > 13  0      0 981179200 315036 70845792    0    0     0     0 22893 301373 11  3 87  0  0
> > 11  0      0 981179712 315036 70845808    0    0     0     0 26933 325728 12  4 84  0  0
> > 30  0      0 981178304 315036 70845824    0    0     0     0 23691 315821 11  4 85  0  0
> > 12  1      0 981177600 315036 70845832    0    0     0     0 29485 320166 12  4 84  0  0
> > 32  0      0 981180032 315036 70845848    0    0     0     0 25946 316724 12  4 84  0  0
> > 21  0      0 981176384 315036 70845864    0    0     0     0 24227 321938 12  4 84  0  0
> > 21  0      0 981178880 315036 70845880    0    0     0     0 25174 326943 13  4 83  0  0
> 
> This machine has 120 cores?  Is hyperthreading enabled?  If so, what
> you are showing might represent a total saturation of the 30 cores.
> Context switches of about 300,000 per second is pretty high.  I can't
> think of when I've seen that except when there is high spinlock
> contention.
> 

Yes, and the hyper-threading is closed. 

> Just to put the above in context, how did you limit the test to 30
> cores?  How many connections were open during the test?
> 

I used numactl to limit the test in the first two sockets (15 cores in each socket)
And there are 90 concurrent connections. 

> > The flame graph is attached. I use 'perf' to generate the flame graph. Only the CPUs running PG server are
profiled.
> > I'm not familiar with other part of PG. Can you find anything unusual in the graph?
> 
> Two SSI functions stand out:
> 10.86% PredicateLockTuple
>  3.51% CheckForSerializableConflictIn
> 
> In both cases, most of that seems to go to lightweight locking.  Since
> you said this is a CPU graph, that again suggests spinlock contention
> issues.
> 
> -- 

Yes. Is there any other kind of locks besides spinlock? I'm reading locks in PG now. If all locks are spinlock, the CPU
shouldbe used 100%. But now only 50% CPU are used. 
 
I'm afraid there are extra time waiting for mutex or semaphore.
These SSI functions will cost more time than reported, because perf doesn't record the time sleeping & waiting for
locks.
 
CheckForSerializableConflictIn takes 10% of running time. (refer to my last email) 

--
Mengxing Liu

Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling fromrw-conflict tracking in serializable transactions

From

"Mengxing Liu"

Date:

09 June 2017, 05:40:31

Thank you very much! I follow your advice and here is the result. 

SerializableXactHashLock 73
predicate_lock_manager 605
WALWriteLock 3
SerializableFinishedListLock 665

There are more than 90 events each time.  SerializableXactHashLock/SerializableFinishedListLock are both used in SSI. 
I think that's why PG is so slow in high contention environment.


> -----Original Messages-----
> From: "Robert Haas" <robertmhaas@gmail.com>
> Sent Time: 2017-06-08 01:30:58 (Thursday)
> To: "Mengxing Liu" <liu-mx15@mails.tsinghua.edu.cn>
> Cc: kgrittn <kgrittn@gmail.com>, "Alvaro Herrera" <alvherre@2ndquadrant.com>, "pgsql-hackers@postgresql.org"
<pgsql-hackers@postgresql.org>
> Subject: Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions
> 
> On Tue, Jun 6, 2017 at 12:16 PM, Mengxing Liu
> <liu-mx15@mails.tsinghua.edu.cn> wrote:
> > I think disk I/O is not the bottleneck in our experiment, but the global lock is.
> 
> A handy way to figure this kind of thing out is to run a query like
> this repeatedly during the benchmark:
> 
> SELECT wait_event_type, wait_event FROM pg_stat_activity;
> 
> I often do this by using psql's \watch command, often \watch 0.5 to
> run it every half-second.  I save all the results collected during the
> benchmark using 'script' and then analyze them to see which wait
> events are most frequent.  If your theory is right, you ought to see
> that SerializableXactHashLock occurs as a wait event very frequently.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
Mengxing Liu