Thread: Parallel Select query performance and shared buffers

Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
We have several independent tables on a multi-core machine serving Select queries. These tables fit into memory; and each Select queries goes over one table's pages sequentially. In this experiment, there are no indexes or table joins.

When we send concurrent Select queries to these tables, query performance doesn't scale out with the number of CPU cores. We find that complex Select queries scale out better than simpler ones. We also find that increasing the block size from 8 KB to 32 KB, or increasing shared_buffers to include the working set mitigates the problem to some extent.

For our experiments, we chose an 8-core machine with 68 GB of memory from Amazon's EC2 service. We installed PostgreSQL 9.3.1 on the instance, and set shared_buffers to 4 GB.

We then generated 1, 2, 4, and 8 separate tables using the data generator from the industry standard TPC-H benchmark. Each table we generated, called lineitem-1, lineitem-2, etc., had about 750 MB of data. Next, we sent 1, 2, 4, and 8 concurrent Select queries to these tables to observe the scale out behavior. Our expectation was that since this machine had 8 cores, our run times would stay constant all throughout. Also, we would have expected the machine's CPU utilization to go up to 100% at 8 concurrent queries. Neither of those assumptions held true.

We found that query run times degraded as we increased the number of concurrent Select queries. Also, CPU utilization flattened out at less than 50% for the simpler queries. Full results with block size of 8KB are below:

                         Table select count(*)    TPC-H Simple (#6)[2]    TPC-H Complex (#1)[1]
1 Table  / 1 query               1.5 s                    2.5 s                  8.4 s
2 Tables / 2 queries             1.5 s                    2.5 s                  8.4 s
4 Tables / 4 queries             2.0 s                    2.9 s                  8.8 s
8 Tables / 8 queries             3.3 s                    4.0 s                  9.6 s

We then increased the block size (BLCKSZ) from 8 KB to 32 KB and recompiled PostgreSQL. This change had a positive impact on query completion times. Here are the new results with block size of 32 KB:

                         Table select count(*)    TPC-H Simple (#6)[2]    TPC-H Complex (#1)[1]
1 Table  / 1 query               1.5 s                    2.3 s                  8.0 s
2 Tables / 2 queries             1.5 s                    2.3 s                  8.0 s
4 Tables / 4 queries             1.6 s                    2.4 s                  8.1 s
8 Tables / 8 queries             1.8 s                    2.7 s                  8.3 s

As a quick side, we also repeated the same experiment on an EC2 instance with 16 CPU cores, and found that the scale out behavior became worse there. (We also tried increasing the shared_buffers to 30 GB. This change completely solved the scaling out problem on this instance type, but hurt our performance on the hi1.4xlarge instances.)

Unfortunately, increasing the block size from 8 to 32 KB has other implications for some of our customers. Could you help us out with the problem here?

What can we do to identify the problem's root cause? Can we work around it?

Thank you,
Metin

Re: Parallel Select query performance and shared buffers

From
Amit Kapila
Date:
On Tue, Dec 3, 2013 at 7:11 PM, Metin Doslu <metin@citusdata.com> wrote:
> We have several independent tables on a multi-core machine serving Select
> queries. These tables fit into memory; and each Select queries goes over one
> table's pages sequentially. In this experiment, there are no indexes or
> table joins.
>
> When we send concurrent Select queries to these tables, query performance
> doesn't scale out with the number of CPU cores. We find that complex Select
> queries scale out better than simpler ones. We also find that increasing the
> block size from 8 KB to 32 KB, or increasing shared_buffers to include the
> working set mitigates the problem to some extent.
>
> For our experiments, we chose an 8-core machine with 68 GB of memory from
> Amazon's EC2 service. We installed PostgreSQL 9.3.1 on the instance, and set
> shared_buffers to 4 GB.
>
> We then generated 1, 2, 4, and 8 separate tables using the data generator
> from the industry standard TPC-H benchmark. Each table we generated, called
> lineitem-1, lineitem-2, etc., had about 750 MB of data.  I think all of this data cannot fit in shared_buffers, you
might
want to increase shared_buffers  to larger size (not 30GB but close to your data size) to see how it behaves


> Next, we sent 1, 2,
> 4, and 8 concurrent Select queries to these tables to observe the scale out
> behavior. Our expectation was that since this machine had 8 cores, our run
> times would stay constant all throughout. Also, we would have expected the
> machine's CPU utilization to go up to 100% at 8 concurrent queries. Neither
> of those assumptions held true.

You queries have Aggregation, ORDER/GROUP BY, so there is a chance
that I/O can happen for those operation's
if PG doesn't have sufficient memory (work_mem) to perform such operation.

> As a quick side, we also repeated the same experiment on an EC2 instance
> with 16 CPU cores, and found that the scale out behavior became worse there.
> (We also tried increasing the shared_buffers to 30 GB. This change
> completely solved the scaling out problem on this instance type, but hurt
> our performance on the hi1.4xlarge instances.)

Instead of 30GB, you can try with lesser value, but it should be close
to your data size.

> Unfortunately, increasing the block size from 8 to 32 KB has other
> implications for some of our customers. Could you help us out with the
> problem here?
>
> What can we do to identify the problem's root cause? Can we work around it?

I think without finding the real cause, it would be difficult to get
the reasonable workaround.
Can you simplify your queries (simple scan or in other words no
aggregation or other things) to see how
they behave in your env., once you are able to see simple queries
scaling as per your expectation, you
can try with complex one's.

Note - post this on pgsql-performance as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel Select query performance and shared buffers

From
Claudio Freire
Date:
On Wed, Dec 4, 2013 at 12:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> As a quick side, we also repeated the same experiment on an EC2 instance
>> with 16 CPU cores, and found that the scale out behavior became worse there.
>> (We also tried increasing the shared_buffers to 30 GB. This change
>> completely solved the scaling out problem on this instance type, but hurt
>> our performance on the hi1.4xlarge instances.)
>
> Instead of 30GB, you can try with lesser value, but it should be close
> to your data size.

The OS cache should have provided a similar function.

In fact, larger shared buffers shouldn't have made a difference if the
main I/O pattern are sequential scans, because they use a ring buffer.

Can we have the explain analyze of those queries, postgres
configuration, perhaps vmstat output during execution?



Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
>   I think all of this data cannot fit in shared_buffers, you might want to increase shared_buffers
>   to larger size (not 30GB but close to your data size) to see how it behaves.

When I use shared_buffers larger than my data size such as 10 GB, results scale nearly as expected at least for this instance type.

> You queries have Aggregation, ORDER/GROUP BY, so there is a chance
> that I/O can happen for those operation's
> if PG doesn't have sufficient memory (work_mem) to perform such operation.

I used work_mem as 32 MB, this should be enough for these queries. I also tested with higher values of work_mem, and didn't obverse any difference.

> Can you simplify your queries (simple scan or in other words no
> aggregation or other things) to see how
> they behave in your env., once you are able to see simple queries
> scaling as per your expectation, you
> can try with complex one's.

Actually we observe problem when queries start to get simpler such as select count(*). Here is the results table in more compact format:

                  select count(*) TPC-H Simple(#6) TPC-H Complex(#1)
1 Table / 1 query      1.5 s            2.5 s           8.4 s
2 Tables/ 2 queries    1.5 s            2.5 s           8.4 s
4 Tables/ 4 queries    2.0 s            2.9 s           8.8 s
8 Tables/ 8 queries    3.3 s            4.0 s           9.6 s

> Can we have the explain analyze of those queries, postgres
> configuration, perhaps vmstat output during execution?

postgres=# explain analyze SELECT count(*) from lineitem_1;
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=199645.01..199645.02 rows=1 width=0) (actual time=11317.391..11317.393 rows=1 loops=1)
   ->  Seq Scan on lineitem_1  (cost=0.00..184641.81 rows=6001281 width=0) (actual time=0.011..5805.255 rows=6001215 loops=1)
 Total runtime: 11317.440 ms
(3 rows)

postgres=# explain analyze SELECT
postgres-#     sum(l_extendedprice * l_discount) as revenue
postgres-# FROM
postgres-#     lineitem_1
postgres-# WHERE
postgres-#     l_shipdate >= date '1994-01-01'
postgres-#     AND l_shipdate < date '1994-01-01' + interval '1' year
postgres-#     AND l_discount between 0.06 - 0.01 AND 0.06 + 0.01
postgres-#     AND l_quantity < 24;
                                                 QUERY PLAN                                                                           
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=260215.36..260215.37 rows=1 width=16) (actual time=1751.775..1751.776 rows=1 loops=1)
   ->  Seq Scan on lineitem_1  (cost=0.00..259657.82 rows=111508 width=16) (actual time=0.031..1630.449 rows=114160 loops=1)
         Filter: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01 00:00:00'::timestamp without time zone) AND (l_discount >= 0.05::double precision) AND (l_discount <= 0.07::double precision) AND
 (l_quantity < 24::double precision))
         Rows Removed by Filter: 5887055
 Total runtime: 1751.830 ms
(5 rows)

postgres=# explain analyze SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
FROM
    lineitem_1
WHERE
    l_shipdate <= date '1998-12-01' - interval '90' day
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=436342.68..436342.69 rows=6 width=36) (actual time=18720.932..18720.936 rows=4 loops=1)
   Sort Key: l_returnflag, l_linestatus
   Sort Method: quicksort  Memory: 25kB
   ->  HashAggregate  (cost=436342.49..436342.60 rows=6 width=36) (actual time=18720.887..18720.892 rows=4 loops=1)
         ->  Seq Scan on lineitem_1  (cost=0.00..199645.01 rows=5917437 width=36) (actual time=0.011..6754.619 rows=5916591 loops=1)
               Filter: (l_shipdate <= '1998-09-02 00:00:00'::timestamp without time zone)
               Rows Removed by Filter: 84624
 Total runtime: 18721.021 ms
(8 rows)


Here are the results of "vmstat 1" while running 8 parallel TPC-H Simple (#6) queries:  Although there is no need for I/O, "wa" fluctuates between 0 and 1. 

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 30093568  84892 38723896    0    0     0     0   22   14  0  0 100  0  0
 8  1      0 30043056  84892 38723896    0    0     0     0 27080 52708 16 14 70  0  0
 8  1      0 30006600  84892 38723896    0    0     0     0 44952 118286 43 44 12  1  0
 8  0      0 29986264  84900 38723896    0    0     0    20 28043 95934 49 42  8  1  0
 7  0      0 29991976  84900 38723896    0    0     0     0 8308 73641 52 42  6  0  0
 0  0      0 30091828  84900 38723896    0    0     0     0 3996 30978 23 24 53  0  0
 0  0      0 30091968  84900 38723896    0    0     0     0   17   23  0  0 100  0  0

I installed PostgreSQL 9.3.1 from source and in postgres configuration file I only changed shared buffers (4 GB) and work_mem (32 MB).

Re: Parallel Select query performance and shared buffers

From
Claudio Freire
Date:
On Wed, Dec 4, 2013 at 9:19 AM, Metin Doslu <metin@citusdata.com> wrote:
>
> Here are the results of "vmstat 1" while running 8 parallel TPC-H Simple
> (#6) queries:  Although there is no need for I/O, "wa" fluctuates between 0
> and 1.
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu-----
>  r  b   swpd   free   buff    cache     si   so    bi    bo    in     cs us sy  id wa st
>  0  0      0 30093568  84892 38723896    0    0     0     0    22     14  0  0 100  0  0
>  8  1      0 30043056  84892 38723896    0    0     0     0 27080  52708 16 14  70  0  0
>  8  1      0 30006600  84892 38723896    0    0     0     0 44952 118286 43 44  12  1  0
>  8  0      0 29986264  84900 38723896    0    0     0    20 28043  95934 49 42   8  1  0
>  7  0      0 29991976  84900 38723896    0    0     0     0  8308  73641 52 42   6  0  0
>  0  0      0 30091828  84900 38723896    0    0     0     0  3996  30978 23 24  53  0  0
>  0  0      0 30091968  84900 38723896    0    0     0     0    17    23   0  0 100  0  0


Notice the huge %sy

What kind of VM are you using? HVM or paravirtual?


Re: Parallel Select query performance and shared buffers

From
Andres Freund
Date:
On 2013-12-04 14:27:10 -0200, Claudio Freire wrote:
> On Wed, Dec 4, 2013 at 9:19 AM, Metin Doslu <metin@citusdata.com> wrote:
> >
> > Here are the results of "vmstat 1" while running 8 parallel TPC-H Simple
> > (#6) queries:  Although there is no need for I/O, "wa" fluctuates between 0
> > and 1.
> >
> > procs -----------memory---------- ---swap-- -----io---- --system--
> > -----cpu-----
> >  r  b   swpd   free   buff    cache     si   so    bi    bo    in     cs us sy  id wa st
> >  0  0      0 30093568  84892 38723896    0    0     0     0    22     14  0  0 100  0  0
> >  8  1      0 30043056  84892 38723896    0    0     0     0 27080  52708 16 14  70  0  0
> >  8  1      0 30006600  84892 38723896    0    0     0     0 44952 118286 43 44  12  1  0
> >  8  0      0 29986264  84900 38723896    0    0     0    20 28043  95934 49 42   8  1  0
> >  7  0      0 29991976  84900 38723896    0    0     0     0  8308  73641 52 42   6  0  0
> >  0  0      0 30091828  84900 38723896    0    0     0     0  3996  30978 23 24  53  0  0
> >  0  0      0 30091968  84900 38723896    0    0     0     0    17    23   0  0 100  0  0
>
>
> Notice the huge %sy

My bet is on transparent hugepage defragmentation. Alternatively it's
scheduler overhead, due to superflous context switches around the buffer
mapping locks.

I'd strongly suggest doing a "perf record -g -a <wait a bit, ctrl-c>;
perf report" run to check what's eating up the time.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
>Notice the huge %sy
>What kind of VM are you using? HVM or paravirtual?

This instance is paravirtual.

Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
> I'd strongly suggest doing a "perf record -g -a <wait a bit, ctrl-c>;
> perf report" run to check what's eating up the time.

Here is one example:

+  38.87%   swapper  [kernel.kallsyms]   [k] hypercall_page
+   9.32%  postgres  [kernel.kallsyms]   [k] hypercall_page
+   6.80%  postgres  [kernel.kallsyms]   [k] xen_set_pte_at
+   5.83%  postgres  [kernel.kallsyms]   [k] copy_user_generic_string
+   2.06%  postgres  [kernel.kallsyms]   [k] file_read_actor
+   1.89%  postgres  postgres            [.] heapgettup_pagemode
+   1.83%  postgres  postgres            [.] hash_search_with_hash_value
+   1.33%  postgres  [kernel.kallsyms]   [k] get_phys_to_machine
+   1.25%  postgres  [kernel.kallsyms]   [k] find_get_page
+   1.00%  postgres  postgres            [.] heapgetpage
+   0.99%  postgres  [kernel.kallsyms]   [k] radix_tree_lookup_element
+   0.98%  postgres  postgres            [.] advance_aggregates
+   0.96%  postgres  postgres            [.] ExecProject
+   0.94%  postgres  postgres            [.] advance_transition_function
+   0.88%  postgres  postgres            [.] ExecScan
+   0.87%  postgres  postgres            [.] HeapTupleSatisfiesMVCC
+   0.86%  postgres  postgres            [.] LWLockAcquire
+   0.82%  postgres  [kernel.kallsyms]   [k] put_page
+   0.82%  postgres  postgres            [.] MemoryContextReset
+   0.80%  postgres  postgres            [.] SeqNext
+   0.78%  postgres  [kernel.kallsyms]   [k] pte_mfn_to_pfn
+   0.69%  postgres  postgres            [.] ExecClearTuple
+   0.57%  postgres  postgres            [.] ExecProcNode
+   0.54%  postgres  postgres            [.] heap_getnext
+   0.53%  postgres  postgres            [.] LWLockRelease
+   0.53%  postgres  postgres            [.] ExecStoreTuple
+   0.51%  postgres  libc-2.12.so        [.] __GI___libc_read
+   0.42%  postgres  [kernel.kallsyms]   [k] xen_spin_lock
+   0.40%  postgres  postgres            [.] ReadBuffer_common
+   0.38%  postgres  [kernel.kallsyms]   [k] __do_fault
+   0.37%  postgres  [kernel.kallsyms]   [k] shmem_fault
+   0.37%  postgres  [kernel.kallsyms]   [k] unmap_single_vma
+   0.35%  postgres  [kernel.kallsyms]   [k] __wake_up_bit
+   0.33%  postgres  postgres            [.] StrategyGetBuffer
+   0.33%  postgres  [kernel.kallsyms]   [k] set_page_dirty
+   0.33%  postgres  [kernel.kallsyms]   [k] handle_pte_fault
+   0.33%  postgres  postgres            [.] ExecAgg
+   0.31%  postgres  postgres            [.] XidInMVCCSnapshot
+   0.31%  postgres  [kernel.kallsyms]   [k] __audit_syscall_entry
+   0.31%  postgres  postgres            [.] CheckForSerializableConflictOut
+   0.29%  postgres  [kernel.kallsyms]   [k] handle_mm_fault
+   0.25%  postgres  [kernel.kallsyms]   [k] shmem_getpage_gfp



On Wed, Dec 4, 2013 at 6:33 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-12-04 14:27:10 -0200, Claudio Freire wrote:
> On Wed, Dec 4, 2013 at 9:19 AM, Metin Doslu <metin@citusdata.com> wrote:
> >
> > Here are the results of "vmstat 1" while running 8 parallel TPC-H Simple
> > (#6) queries:  Although there is no need for I/O, "wa" fluctuates between 0
> > and 1.
> >
> > procs -----------memory---------- ---swap-- -----io---- --system--
> > -----cpu-----
> >  r  b   swpd   free   buff    cache     si   so    bi    bo    in     cs us sy  id wa st
> >  0  0      0 30093568  84892 38723896    0    0     0     0    22     14  0  0 100  0  0
> >  8  1      0 30043056  84892 38723896    0    0     0     0 27080  52708 16 14  70  0  0
> >  8  1      0 30006600  84892 38723896    0    0     0     0 44952 118286 43 44  12  1  0
> >  8  0      0 29986264  84900 38723896    0    0     0    20 28043  95934 49 42   8  1  0
> >  7  0      0 29991976  84900 38723896    0    0     0     0  8308  73641 52 42   6  0  0
> >  0  0      0 30091828  84900 38723896    0    0     0     0  3996  30978 23 24  53  0  0
> >  0  0      0 30091968  84900 38723896    0    0     0     0    17    23   0  0 100  0  0
>
>
> Notice the huge %sy

My bet is on transparent hugepage defragmentation. Alternatively it's
scheduler overhead, due to superflous context switches around the buffer
mapping locks.

I'd strongly suggest doing a "perf record -g -a <wait a bit, ctrl-c>;
perf report" run to check what's eating up the time.

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Select query performance and shared buffers

From
Andres Freund
Date:
On 2013-12-04 18:43:35 +0200, Metin Doslu wrote:
> > I'd strongly suggest doing a "perf record -g -a <wait a bit, ctrl-c>;
> > perf report" run to check what's eating up the time.
>
> Here is one example:
>
> +  38.87%   swapper  [kernel.kallsyms]   [k] hypercall_page
> +   9.32%  postgres  [kernel.kallsyms]   [k] hypercall_page
> +   6.80%  postgres  [kernel.kallsyms]   [k] xen_set_pte_at

All that time is spent in your virtualization solution. One thing to try
is to look on the host system, sometimes profiles there can be more
meaningful.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel Select query performance and shared buffers

From
Claudio Freire
Date:
On Wed, Dec 4, 2013 at 1:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-04 18:43:35 +0200, Metin Doslu wrote:
>> > I'd strongly suggest doing a "perf record -g -a <wait a bit, ctrl-c>;
>> > perf report" run to check what's eating up the time.
>>
>> Here is one example:
>>
>> +  38.87%   swapper  [kernel.kallsyms]   [k] hypercall_page
>> +   9.32%  postgres  [kernel.kallsyms]   [k] hypercall_page
>> +   6.80%  postgres  [kernel.kallsyms]   [k] xen_set_pte_at
>
> All that time is spent in your virtualization solution. One thing to try
> is to look on the host system, sometimes profiles there can be more
> meaningful.

You cannot profile the host on EC2.

You could try HVM. I've noticed it fare better  under heavy CPU load,
and it's not fully-HVM (it still uses paravirtualized network and
I/O).


Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
> You could try HVM. I've noticed it fare better  under heavy CPU load,
> and it's not fully-HVM (it still uses paravirtualized network and
> I/O).

I already tried with HVM (cc2.8xlarge instance on Amazon EC2) and observed same problem.

Re: Parallel Select query performance and shared buffers

From
Andres Freund
Date:
On 2013-12-04 16:00:40 -0200, Claudio Freire wrote:
> On Wed, Dec 4, 2013 at 1:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > All that time is spent in your virtualization solution. One thing to try
> > is to look on the host system, sometimes profiles there can be more
> > meaningful.
>
> You cannot profile the host on EC2.

Didn't follow the thread from the start. So, this is EC2? Have you
checked, with a recent enough version of top or whatever, how much time
is reported as "stolen"?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
> Didn't follow the thread from the start. So, this is EC2? Have you
> checked, with a recent enough version of top or whatever, how much time
> is reported as "stolen"?

Yes, this EC2. "stolen" is randomly reported as 1, mostly as 0.

Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
Here are some extra information:

- When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is disappeared for 8 core machines and come back with 16 core machines on Amazon EC2. Would it be related with PostgreSQL locking mechanism?

- I tried this test with 4 core machines including my personel computer and some other instances on Amazon EC2, I didn't see this problem with 4 core machines. I started to see this problem in PostgreSQL when core count is 8 or more.

Here are the results of "vmstat 1" while running 8 parallel select count(*). Normally I would expect zero idle time.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 29838640  94000 38954740    0    0     0     0   22   21  0  0 100  0  0
 7  2      0 29788416  94000 38954740    0    0     0     0 53922 108490 14 24 60  1  1
 5  0      0 29747248  94000 38954740    0    0     0     0 68008 164571 22 48 27  2  1
 8  0      0 29725796  94000 38954740    0    0     0     0 43587 150574 28 54 16  1  1
 0  0      0 29838328  94000 38954740    0    0     0     0 15584 100459 26 55 18  1  0
 0  0      0 29838328  94000 38954740    0    0     0     0   42   15  0  0 100  0  0

- When I run 8 parallel wc command or other scripts, they scale out as expected and they utilize all cpu. This leads me to think that problem is related with PostgreSQL instead of OS.

Re: Parallel Select query performance and shared buffers

From
Andres Freund
Date:
On 2013-12-04 20:19:55 +0200, Metin Doslu wrote:
> - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> disappeared for 8 core machines and come back with 16 core machines on
> Amazon EC2. Would it be related with PostgreSQL locking mechanism?

You could try my lwlock-scalability improvement patches - for some
workloads here, the improvements have been rather noticeable. Which
version are you testing?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
> You could try my lwlock-scalability improvement patches - for some
> workloads here, the improvements have been rather noticeable. Which
> version are you testing?

I'm testing with PostgreSQL 9.3.1.

Re: Parallel Select query performance and shared buffers

From
Amit Kapila
Date:
On Wed, Dec 4, 2013 at 10:40 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Dec 4, 2013 at 12:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> As a quick side, we also repeated the same experiment on an EC2 instance
>>> with 16 CPU cores, and found that the scale out behavior became worse there.
>>> (We also tried increasing the shared_buffers to 30 GB. This change
>>> completely solved the scaling out problem on this instance type, but hurt
>>> our performance on the hi1.4xlarge instances.)
>>
>> Instead of 30GB, you can try with lesser value, but it should be close
>> to your data size.
>
> The OS cache should have provided a similar function.
  The performance cannot be same when those pages are in shared buffers as   a. OS can flush those pages   b. anyway
loadingit again in shared buffers will have some overhead.
 

> In fact, larger shared buffers shouldn't have made a difference if the
> main I/O pattern are sequential scans, because they use a ring buffer.
  Yeah, this is right, but then why he is able to see scaling when he
increased shared buffer's  to larger value.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel Select query performance and shared buffers

From
Amit Kapila
Date:
On Wed, Dec 4, 2013 at 11:49 PM, Metin Doslu <metin@citusdata.com> wrote:
> Here are some extra information:
>
> - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> disappeared for 8 core machines and come back with 16 core machines on
> Amazon EC2. Would it be related with PostgreSQL locking mechanism?

  I think here there is a good chance of improvement with the patch
suggested by Andres in this thread, but
  still i think it might not completely resolve the current problem as
there will be overhead of associating data
  with shared buffers.

  Currently NUM_BUFFER_PARTITIONS is fixed, so may be auto tuning it
based on some parameter's can
  help such situations.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
> - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> disappeared for 8 core machines and come back with 16 core machines on
> Amazon EC2. Would it be related with PostgreSQL locking mechanism?

If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we see tons of contention with default value of NUM_BUFFER_PARTITIONS which is 16:

$ tail -f /tmp/logfile | grep lwlock | egrep -v "blk 0"
...
PID 15965 lwlock 0: shacq 0 exacq 33 blk 2
PID 15965 lwlock 34: shacq 14010 exacq 27134 blk 6192
PID 15965 lwlock 35: shacq 14159 exacq 27397 blk 5426
PID 15965 lwlock 36: shacq 14111 exacq 27322 blk 4959
PID 15965 lwlock 37: shacq 14211 exacq 27507 blk 4370
PID 15965 lwlock 38: shacq 14110 exacq 27294 blk 3980
PID 15965 lwlock 39: shacq 13962 exacq 27027 blk 3719
PID 15965 lwlock 40: shacq 14023 exacq 27156 blk 3273
PID 15965 lwlock 41: shacq 14107 exacq 27309 blk 3201
PID 15965 lwlock 42: shacq 14120 exacq 27304 blk 2904
PID 15965 lwlock 43: shacq 14007 exacq 27129 blk 2740
PID 15965 lwlock 44: shacq 13948 exacq 27027 blk 2616
PID 15965 lwlock 45: shacq 14041 exacq 27198 blk 2431
PID 15965 lwlock 46: shacq 14067 exacq 27277 blk 2345
PID 15965 lwlock 47: shacq 14050 exacq 27203 blk 2106
PID 15965 lwlock 48: shacq 13910 exacq 26910 blk 2155
PID 15965 lwlock 49: shacq 14170 exacq 27360 blk 1989

After we increased NUM_BUFFER_PARTITIONS to 1024, lock contention is decreased:
...
PID 25220 lwlock 1000: shacq 247 exacq 494 blk 1
PID 25220 lwlock 1001: shacq 198 exacq 394 blk 1
PID 25220 lwlock 1002: shacq 203 exacq 404 blk 1
PID 25220 lwlock 1003: shacq 226 exacq 452 blk 1
PID 25220 lwlock 1004: shacq 235 exacq 470 blk 1
PID 25220 lwlock 1006: shacq 226 exacq 452 blk 2
PID 25220 lwlock 1007: shacq 214 exacq 428 blk 1
PID 25220 lwlock 1008: shacq 225 exacq 448 blk 1
PID 25220 lwlock 1010: shacq 209 exacq 418 blk 1
PID 25220 lwlock 1015: shacq 199 exacq 398 blk 1
PID 25220 lwlock 1016: shacq 214 exacq 426 blk 1
PID 25220 lwlock 1018: shacq 230 exacq 456 blk 1
PID 25220 lwlock 1019: shacq 222 exacq 444 blk 3
PID 25220 lwlock 1023: shacq 262 exacq 524 blk 1
PID 25220 lwlock 1027: shacq 213 exacq 426 blk 1
PID 25220 lwlock 1028: shacq 246 exacq 491 blk 1
PID 25220 lwlock 1029: shacq 226 exacq 452 blk 1

Re: Parallel Select query performance and shared buffers

From
Andres Freund
Date:
On 2013-12-05 11:15:20 +0200, Metin Doslu wrote:
> > - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> > disappeared for 8 core machines and come back with 16 core machines on
> > Amazon EC2. Would it be related with PostgreSQL locking mechanism?
>
> If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we
> see tons of contention with default value of NUM_BUFFER_PARTITIONS which is
> 16:

Is your workload bigger than RAM? I think a good bit of the contention
you're seeing in that listing is populating shared_buffers - and might
actually vanish once you're halfway cached.
From what I've seen so far the bigger problem than contention in the
lwlocks itself, is the spinlock protecting the lwlocks...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
> Is your workload bigger than RAM?

RAM is bigger than workload (more than a couple of times).

> I think a good bit of the contention
> you're seeing in that listing is populating shared_buffers - and might
> actually vanish once you're halfway cached.
> From what I've seen so far the bigger problem than contention in the
> lwlocks itself, is the spinlock protecting the lwlocks...

Could you clarify a bit what do you mean by "halfway cached" and "spinlock protecting the lwlocks".

Re: Parallel Select query performance and shared buffers

From
Andres Freund
Date:
On 2013-12-05 11:33:29 +0200, Metin Doslu wrote:
> > Is your workload bigger than RAM?
>
> RAM is bigger than workload (more than a couple of times).

> > I think a good bit of the contention
> > you're seeing in that listing is populating shared_buffers - and might
> > actually vanish once you're halfway cached.
> > From what I've seen so far the bigger problem than contention in the
> > lwlocks itself, is the spinlock protecting the lwlocks...
>
> Could you clarify a bit what do you mean by "halfway cached"

Well, your stats showed a) fairly low lock counts overall b) a high
percentage of exclusive locks.
a) indicates the system wasn't running long.
b) tells me there were lots of changes to the buffer mapping - which
   basically only happens if a buffer is placed or removed from
   shared-buffers.

If your shared_buffers is big enough to contain most of the data you
shouldn't see many exclusive locks in comparison to the number of shared
locks.

> and "spinlock protecting the lwlocks".

Every LWLock has an internal spinlock to protect its state. So whenever
somebody does a LWLockAcquire()/Release(), even if only in shared mode,
we currently acquire that spinlock, manipulate the LWLocks state, and
release the spinlock again. In lots of workloads that internal spinlock
is the contention point, not the lenght over which the lwlock is
held. Especially when they are mostly held in shared mode.

Makes sense?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
> You could try my lwlock-scalability improvement patches - for some
> workloads here, the improvements have been rather noticeable. Which
> version are you testing?

I tried your patches on next link. As you suspect I didn't see any improvements. I tested it on PostgreSQL 9.2 Stable.


On Wed, Dec 4, 2013 at 8:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-12-04 20:19:55 +0200, Metin Doslu wrote:
> - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> disappeared for 8 core machines and come back with 16 core machines on
> Amazon EC2. Would it be related with PostgreSQL locking mechanism?

You could try my lwlock-scalability improvement patches - for some
workloads here, the improvements have been rather noticeable. Which
version are you testing?

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Select query performance and shared buffers

From
Andres Freund
Date:
On 2013-12-05 17:46:44 +0200, Metin Doslu wrote:
> I tried your patches on next link. As you suspect I didn't see any
> improvements. I tested it on PostgreSQL 9.2 Stable.

You tested the correct branch, right? Which commit does "git rev-parse
HEAD" show?

But generally, as long as your profile hides all the important
information behind the hypervisor's cost, you're going to have a hard
time analyzing the problems. You really should try to reproduce the
problems on native hardware (as similar to the host hardware as
possible), to get accurate data. On CPU bound workloads that information
is often transportable to the virtual world.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
> You tested the correct branch, right? Which commit does "git rev-parse
> HEAD" show?

I applied last two patches manually on PostgreSQL 9.2 Stable.

Re: Parallel Select query performance and shared buffers

From
Metin Doslu
Date:
> From what I've seen so far the bigger problem than contention in the
> lwlocks itself, is the spinlock protecting the lwlocks...

Postgres 9.3.1 also reports spindelay, it seems that there is no contention on spinlocks.

PID 21121 lwlock 0: shacq 0 exacq 33 blk 1 spindelay 0
PID 21121 lwlock 33: shacq 7602 exacq 14688 blk 4381 spindelay 0
PID 21121 lwlock 34: shacq 7826 exacq 15113 blk 3786 spindelay 0
PID 21121 lwlock 35: shacq 7792 exacq 15110 blk 3356 spindelay 0
PID 21121 lwlock 36: shacq 7803 exacq 15125 blk 3075 spindelay 0
PID 21121 lwlock 37: shacq 7822 exacq 15177 blk 2756 spindelay 0
PID 21121 lwlock 38: shacq 7694 exacq 14863 blk 2513 spindelay 0
PID 21121 lwlock 39: shacq 7914 exacq 15320 blk 2400 spindelay 0
PID 21121 lwlock 40: shacq 7855 exacq 15203 blk 2220 spindelay 0
PID 21121 lwlock 41: shacq 7942 exacq 15363 blk 1996 spindelay 0
PID 21121 lwlock 42: shacq 7828 exacq 15115 blk 1872 spindelay 0
PID 21121 lwlock 43: shacq 7820 exacq 15159 blk 1833 spindelay 0
PID 21121 lwlock 44: shacq 7709 exacq 14916 blk 1590 spindelay 0
PID 21121 lwlock 45: shacq 7831 exacq 15134 blk 1619 spindelay 0
PID 21121 lwlock 46: shacq 7744 exacq 14989 blk 1559 spindelay 0
PID 21121 lwlock 47: shacq 7808 exacq 15111 blk 1473 spindelay 0
PID 21121 lwlock 48: shacq 7729 exacq 14929 blk 1381 spindelay 0

Re: Parallel Select query performance and shared buffers

From
Claudio Freire
Date:
On Thu, Dec 5, 2013 at 1:03 PM, Metin Doslu <metin@citusdata.com> wrote:
>> From what I've seen so far the bigger problem than contention in the
>> lwlocks itself, is the spinlock protecting the lwlocks...
>
> Postgres 9.3.1 also reports spindelay, it seems that there is no contention
> on spinlocks.


Did you check hugepages?