Thread: How to make partitioning scale better for larger numbers ofpartitions

How to make partitioning scale better for larger numbers ofpartitions

From
"Kato, Sho"
Date:

Hi,

 

I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables.

But, statement latencies on a partitioned table is much slower than on a non-partitioned table.

 

UPDATE latency is 210 times slower than a non-partitioned table.

SELECT latency is 36 times slower than a non-partitioned table.

Surprisingly INSERT latency is almost same.

 

Of course I'm sure table partitioning work well with up to a hundred partitions as written on the postgresql document.

But, my customer will use partitioned table with 1.1k leaf partitions.

So, we need to improve performance.

 

Any ideas?

 

The results of pgbench and perf are listed below.

 

pgbench results

---------------

 

transaction type: update.sql

scaling factor: 1

query mode: simple

number of clients: 1

number of threads: 1

duration: 180 s

number of transactions actually processed: 648

latency average = 278.202 ms

tps = 3.594512 (including connections establishing)

tps = 3.594545 (excluding connections establishing)

statement latencies in milliseconds:

         0.011  \set aid random(1, 1100 * 1)

         0.004  \set delta random(-5000, 5000)

         0.038  BEGIN;

       277.005  UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;

         1.140  END;

 

transaction type: select.sql

scaling factor: 1

query mode: simple

number of clients: 1

number of threads: 1

duration: 180 s

number of transactions actually processed: 19415

latency average = 9.281 ms

tps = 107.745068 (including connections establishing)

tps = 107.746067 (excluding connections establishing)

statement latencies in milliseconds:

         0.800  \set aid random(1, 1100 * 1)

         0.137  \set delta random(-5000, 5000)

         1.351  BEGIN;

         4.941  SELECT abalance FROM test.accounts WHERE aid = :aid;

         2.052  END;

 

transaction type: insert.sql

scaling factor: 1

query mode: simple

number of clients: 1

number of threads: 1

duration: 180 s

number of transactions actually processed: 31895

latency average = 5.654 ms

tps = 176.865541 (including connections establishing)

tps = 176.867086 (excluding connections establishing)

statement latencies in milliseconds:

         2.083  \set aid random(1, 1100 * 1)

         0.003  \set delta random(-5000, 5000)

         0.029  BEGIN;

         3.222  INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);

         0.317  END;

 

Top 10 of perf report

------------

 

UPDATE:

21.33%  postgres  postgres           [.] range_table_mutator

12.57%  postgres  postgres           [.] AllocSetAlloc

  4.97%  postgres  postgres           [.] palloc

  4.48%  postgres  postgres           [.] make_one_rel

  3.96%  postgres  postgres           [.] lappend

  2.74%  postgres  [kernel.kallsyms]  [k] get_page_from_freelist

  1.87%  postgres  postgres           [.] setup_append_rel_array

  1.68%  postgres  [kernel.kallsyms]  [k] list_del

  1.64%  postgres  [kernel.kallsyms]  [k] __alloc_pages_nodemask

  1.62%  postgres  [kernel.kallsyms]  [k] unmap_vmas

 

SELECT:

14.72%  postgres  postgres           [.] AllocSetAlloc

  5.14%  postgres  postgres           [.] hash_search_with_hash_value

  4.23%  postgres  postgres           [.] palloc

  4.06%  postgres  postgres           [.] MemoryContextAllocZeroAligned

  2.61%  postgres  postgres           [.] copyObjectImpl

  2.34%  postgres  postgres           [.] expression_tree_mutator

  2.13%  postgres  [kernel.kallsyms]  [k] _spin_lock

  1.91%  postgres  postgres           [.] lappend

  1.59%  postgres  [kernel.kallsyms]  [k] __link_path_walk

  1.50%  postgres  postgres           [.] set_rel_size

 

INSERT:

20.75%  postgres  postgres           [.] hash_search_with_hash_value

  6.03%  postgres  postgres           [.] hash_any

  4.88%  postgres  postgres           [.] AllocSetAlloc

  4.05%  postgres  postgres           [.] LWLockRelease

  4.05%  postgres  postgres           [.] LWLockAcquire

  3.27%  postgres  postgres           [.] oid_cmp

  3.06%  postgres  postgres           [.] SearchCatCache

  2.97%  postgres  postgres           [.] LockReleaseAll

  2.57%  postgres  postgres           [.] pg_qsort

  2.37%  postgres  postgres           [.] hash_seq_search

 

 

The following is information on the environment used for the benchmark.

 

Server spec

-----------

 

  Server has 16 cpu.

  Memory size is 264GB.

  Database directory is on SSD.

 

database tuning

---------------

 

 shared_buffers = 102GB

  max_locks_per_transactions = 1000000

 

postgresql version

------------------

 

  11beta2 + patch1 + patch2

 

  patch1: Allow direct lookups of AppendRelInfo by child relid

          commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687

 

  patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch

          https://commitfest.postgresql.org/18/1690/

 

 

table definition

----------------

 

  create table test.accounts(aid serial, abalance int) partition by range(aid));

  create table test.accounts_history(id serial, aid int, delta int, mtime timestamp without time zone) partition by range(aid);

 

  create table test.account_part_1 partition of test.accounts for values from (1) to (2);

  create table test.account_part_2 partition of test.accounts for values from (2) to (3);

  .

  .

  create table test.account_part_1100 partition of test.accounts for values from (1100) to (1101);

 

  accounts_history is also partitioned in the same way.

 

  There is only one data in each leaf partitions for UPDATE/SELECT benchmark.

 

  regards,

RE: How to make partitioning scale better for larger numbers ofpartitions

From
"Kato, Sho"
Date:
>I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule
or>INSERT/UPDATE directly into child) ?
 

Thank you for your reply.

I compared to PG11beta2 with non-partitioned table.

Non-partitioned table has 1100 records in one table.
Partitioned table has one record on each leaf partitions.

Regards,
-----Original Message-----
From: Justin Pryzby [mailto:pryzby@telsasoft.com] 
Sent: Friday, July 13, 2018 12:11 PM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>
Subject: Re: How to make partitioning scale better for larger numbers of partitions

Hi,

On Fri, Jul 13, 2018 at 02:58:53AM +0000, Kato, Sho wrote:
> I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables.
> But, statement latencies on a partitioned table is much slower than on a non-partitioned table.
> 
> UPDATE latency is 210 times slower than a non-partitioned table.
> SELECT latency is 36 times slower than a non-partitioned table.
> Surprisingly INSERT latency is almost same.

I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule
orINSERT/UPDATE directly into child) ?
 

Justin





Re: How to make partitioning scale better for larger numbers ofpartitions

From
Amit Langote
Date:
Kato-san,

On 2018/07/13 11:58, Kato, Sho wrote:
> Hi,
> 
> I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables.

Thanks for sharing the results.

> But, statement latencies on a partitioned table is much slower than on a non-partitioned table.
> 
> UPDATE latency is 210 times slower than a non-partitioned table.
> SELECT latency is 36 times slower than a non-partitioned table.
> Surprisingly INSERT latency is almost same.

Yes, INSERT comes out ahead because there is no overhead of partitioning
in the planning phase.  As David Rowley reported on the nearby thread
("Speeding up INSERTs and UPDATEs to partitioned tables"), there is still
significant overhead during its execution, so we're still a bit a fair bit
away from the best possible performance.

For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
is pretty significant and gets worse as the number of partitions grows.  I
had intended to fix that in PG 11, but we could only manage to get part of
that work into PG 11, with significant help from David and others.  So,
while PG 11's overhead of partitioning during planning is less than PG
10's due to improved pruning algorithm, it's still pretty far from ideal,
because it isn't just the pruning algorithm that had overheads.  In fact,
PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
carry the overhead that was in PG 10.  The overheads I mention stem from
the fact that for partitioning we still rely on the old planner code
that's used to perform inheritance planning, which requires to lock and
open *all* partitions.  We have so far been able to refactor just enough
to use the new code for partition pruning, but there is much refactoring
work left to avoid needlessly locking and opening all partitions.

Thanks,
Amit



RE: How to make partitioning scale better for larger numbers ofpartitions

From
"Tsunakawa, Takayuki"
Date:
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
> is pretty significant and gets worse as the number of partitions grows.
> I
> had intended to fix that in PG 11, but we could only manage to get part
> of
> that work into PG 11, with significant help from David and others.  So,
> while PG 11's overhead of partitioning during planning is less than PG
> 10's due to improved pruning algorithm, it's still pretty far from ideal,
> because it isn't just the pruning algorithm that had overheads.  In fact,
> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
> carry the overhead that was in PG 10.

David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it in
thecurrent CF, though.)  What challenges are there for future versions, and which of them are being addressed by
patchesin progress for PG 12, and which issues are untouched?
 

> The overheads I mention stem from
> the fact that for partitioning we still rely on the old planner code
> that's used to perform inheritance planning, which requires to lock and
> open *all* partitions.  We have so far been able to refactor just enough
> to use the new code for partition pruning, but there is much refactoring
> work left to avoid needlessly locking and opening all partitions.

I remember the issue of opening and locking all partitions was discussed before.  Does this relate only to the planning
phase?

Kato-san, could you try pgbench -M prepared?


Regards
Takayuki Tsunakawa





Re: How to make partitioning scale better for larger numbers ofpartitions

From
Justin Pryzby
Date:
On Fri, Jul 13, 2018 at 05:49:20AM +0000, Tsunakawa, Takayuki wrote:
> David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it
inthe current CF, though.)  What challenges are there for future versions, and which of them are being addressed by
patchesin progress for PG 12, and which issues are untouched?
 

Here's an known issue which I'm not sure is on anybody's list:

https://www.postgresql.org/message-id/flat/4F989CD8.8020804%40strategicdata.com.au#4F989CD8.8020804@strategicdata.com.au
=> planning of UPDATE/DELETE (but not SELECT) takes huge amount of RAM when run
on parent with many childs.

We don't typically have UPDATEs, and those we do are against the child tables;
but ran into this last month with a manual query on parent causing OOM.  I
worked around it, but keep meaning to revisit to see if this changed at all in
v11 (very brief testing suggests not changed).

Justin


Re: How to make partitioning scale better for larger numbers of partitions

From
David Rowley
Date:
On 13 July 2018 at 17:49, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it
inthe current CF, though.)  What challenges are there for future versions, and which of them are being addressed by
patchesin progress for PG 12, and which issues are untouched? 

I've not submitted that for PG12 yet. I had other ideas about just
getting rid of the inheritance planner altogether, but so far don't
have a patch for that. Still uncertain if there are any huge blockers
to that either. Join search needs performed multiple times, but a good
advantage will be that we can take advantage of partition pruning to
only join search the non-pruned partitions.

The reason I change my mind about the patch you're talking about is
that it meant that RangeTblEntries needed to keep the same relation id
in the inheritance planner as they did in the grouping planner, and
another patch I have semi-baked delays building both RelOptInfo and
RangeTblEntry records for partitions until after pruning. Without the
RangeTblEntry it was difficult to ensure the relids were in lock-step
between the two planners.  There are ways to do it, it just didn't
look pretty.

Hoping to have something later in the cycle.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: How to make partitioning scale better for larger numbers of partitions

From
David Rowley
Date:
On 13 July 2018 at 14:58, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
> Of course I'm sure table partitioning work well with up to a hundred
> partitions as written on the postgresql document.
>
> But, my customer will use partitioned table with 1.1k leaf partitions.
>
> So, we need to improve performance.
>
> Any ideas?

It would be really good if you could review and test my partitioning
patches in the current commitfest. These are the ones authored by me
with the word "partition" in the title.

These 4 patches don't solve all the problems, but they do make some
good gains in some areas. I've still more work to do, so the earlier I
can be finished with the 4 that are there now, the more I can focus on
the rest.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: How to make partitioning scale better for larger numbers ofpartitions

From
Amit Langote
Date:
On 2018/07/13 14:49, Tsunakawa, Takayuki wrote:
> From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
>> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
>> is pretty significant and gets worse as the number of partitions grows.
>> I
>> had intended to fix that in PG 11, but we could only manage to get part
>> of
>> that work into PG 11, with significant help from David and others.  So,
>> while PG 11's overhead of partitioning during planning is less than PG
>> 10's due to improved pruning algorithm, it's still pretty far from ideal,
>> because it isn't just the pruning algorithm that had overheads.  In fact,
>> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
>> carry the overhead that was in PG 10.
> 
> David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it
inthe current CF, though.)
 

I don't think he has posted a new patch for improving the speed for
UDPATE/DELETE planning yet, although I remember he had posted a PoC patch
back in February or March.

> What challenges are there for future versions, and which of them are being addressed by patches in progress for PG
12,and which issues are untouched?
 

The immediate one I think is to refactor the planner such that the new
pruning code, that we were able to utilize for SELECT in PG 11, can also
be used for UPDATE/DELETE.  Refactoring needed to replace the pruning
algorithm was minimal for SELECT, but not so much for UPDATE/DELETE.

Once we're able to utilize the new pruning algorithm for all the cases, we
can move on to refactoring to avoid locking and opening of all partitions.
 As long as we're relying on constraint exclusion for partition pruning,
which we still do for UPDATE/DELETE, we cannot do that because constraint
exclusion has to look at each partition individually.

The UPDATE/DELETE planning for partitioning using huge memory and CPU is a
pretty big issue and refactoring planner to avoid that may be what's
hardest of all the problems to be solved here.

>> The overheads I mention stem from
>> the fact that for partitioning we still rely on the old planner code
>> that's used to perform inheritance planning, which requires to lock and
>> open *all* partitions.  We have so far been able to refactor just enough
>> to use the new code for partition pruning, but there is much refactoring
>> work left to avoid needlessly locking and opening all partitions.
> 
> I remember the issue of opening and locking all partitions was discussed before.

Quite a few times I suppose.

> Does this relate only to the planning phase?

If the benchmark contains queries that will need to access just one
partition, then yes the planning part has is the biggest overhead.

Execution-time overhead is limited to having an extra, possibly needless,
Append node, but I know David has patch for that too.

Thanks,
Amit



RE: How to make partitioning scale better for larger numbers ofpartitions

From
"Tsunakawa, Takayuki"
Date:
From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> > David has submitted multiple patches for PG 12, one of which speeds up
> pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)
> What challenges are there for future versions, and which of them are being
> addressed by patches in progress for PG 12, and which issues are untouched?
> 
> I've not submitted that for PG12 yet. I had other ideas about just
> getting rid of the inheritance planner altogether, but so far don't
> have a patch for that. Still uncertain if there are any huge blockers
> to that either.

Sorry, I seem to have misunderstood something.

By the way, what do you think is the "ideal and should-be-feasible" goal and the "realistic" goal we can reach in the
nearfuture (e.g. PG 12)?  Say,
 

* Planning and execution time is O(log n), where n is the number of partitions
* Planning time is O(log n), execution time is O(1)
* Planning and execution time is O(1), where n is the number of partitions

Regards
Takayuki Tsunakawa



Re: How to make partitioning scale better for larger numbers of partitions

From
David Rowley
Date:
On 13 July 2018 at 18:53, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> By the way, what do you think is the "ideal and should-be-feasible" goal and the "realistic" goal we can reach in the
nearfuture (e.g. PG 12)?  Say,
 

Depends. Patched don't move that fast without review and nothing gets
committed without a committer.

> * Planning and execution time is O(log n), where n is the number of partitions
> * Planning time is O(log n), execution time is O(1)
> * Planning and execution time is O(1), where n is the number of partitions

It's going to depend on how many partitions are pruned. We still need
to generate paths for all non-pruned partitions which is going to be
slow when there are many partitions.

I think we can get pretty close to the non-partitioned planning
performance with SELECT/UPDATE/DELETE when all but 1 partition
survives pruning. There are always going to be some additional
overhead we can't get rid of, but hopefully, those will be small.

Please feel free to review what I have in the July 'fest.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


RE: How to make partitioning scale better for larger numbers ofpartitions

From
"Tsunakawa, Takayuki"
Date:
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
> The immediate one I think is to refactor the planner such that the new
> pruning code, that we were able to utilize for SELECT in PG 11, can also
> be used for UPDATE/DELETE.  Refactoring needed to replace the pruning
> algorithm was minimal for SELECT, but not so much for UPDATE/DELETE.
> 
> Once we're able to utilize the new pruning algorithm for all the cases,
> we
> can move on to refactoring to avoid locking and opening of all partitions.
>  As long as we're relying on constraint exclusion for partition pruning,
> which we still do for UPDATE/DELETE, we cannot do that because constraint
> exclusion has to look at each partition individually.
> 
> The UPDATE/DELETE planning for partitioning using huge memory and CPU is
> a
> pretty big issue and refactoring planner to avoid that may be what's
> hardest of all the problems to be solved here.

Thank you.  There seem to be many challenges to address...  As a user and PG developer, I'd be happy to see some wiki
pagethat describes the current performance characteristics in terms of # of partitions, the ideal and reachable
performance,and the challenges to overcome to reach that ideal goal.
 


> If the benchmark contains queries that will need to access just one
> partition, then yes the planning part has is the biggest overhead.
> 
> Execution-time overhead is limited to having an extra, possibly needless,
> Append node, but I know David has patch for that too.

That's good news, thanks.  Our user will perform around a hundred single-row INSERT/SELECT/UPDATE/DELETE statements in
eachtransaction, and those are PREPAREd.  I hope PG 11 (with David's patches) will work well for that workload.  I'll
waitfor Kato-san's pgbench -M prepared result.
 

Regards
Takayuki Tsunakawa






RE: How to make partitioning scale better for larger numbers ofpartitions

From
"Kato, Sho"
Date:
Hi Amit,

Thank you for letting me know challenges of partitioning.

>So, while PG 11's overhead of partitioning during planning is less than PG 10's due to improved pruning algorithm,
it'sstill pretty far from ideal, because it isn't just the pruning algorithm that had overheads. 

That makes sense.
I also benchmark PG10.
Actually, SELECT latency on PG11beta2 + patch1 is faster than PG10.

SELECT latency with 800 leaf partition
--------------------------------------
PG10 5.62 ms
PG11  3.869 ms

But, even PG11, SELECT statement takes 21.102ms on benchmark with 1600 leaf partitions.
It takes a long time though partition pruning algorithm of PG11 is binary search.

>The overheads I mention stem from the fact that for partitioning we still rely on the old planner code that's used to
performinheritance planning, which requires to lock and open *all* partitions. 

I debug update statement execution on partitioned table.
range_table_mutator seems process all leaf partitions.

-----Original Message-----
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
Sent: Friday, July 13, 2018 1:35 PM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org>
Subject: Re: How to make partitioning scale better for larger numbers of partitions

Kato-san,

On 2018/07/13 11:58, Kato, Sho wrote:
> Hi,
>
> I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables.

Thanks for sharing the results.

> But, statement latencies on a partitioned table is much slower than on a non-partitioned table.
>
> UPDATE latency is 210 times slower than a non-partitioned table.
> SELECT latency is 36 times slower than a non-partitioned table.
> Surprisingly INSERT latency is almost same.

Yes, INSERT comes out ahead because there is no overhead of partitioning in the planning phase.  As David Rowley
reportedon the nearby thread ("Speeding up INSERTs and UPDATEs to partitioned tables"), there is still significant
overheadduring its execution, so we're still a bit a fair bit away from the best possible performance. 

For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase is pretty significant and gets worse as the
numberof partitions grows.  I had intended to fix that in PG 11, but we could only manage to get part of that work into
PG11, with significant help from David and others.  So, while PG 11's overhead of partitioning during planning is less
thanPG 10's due to improved pruning algorithm, it's still pretty far from ideal, because it isn't just the pruning
algorithmthat had overheads.  In fact, PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still carry
theoverhead that was in PG 10.  The overheads I mention stem from the fact that for partitioning we still rely on the
oldplanner code that's used to perform inheritance planning, which requires to lock and open *all* partitions.  We have
sofar been able to refactor just enough to use the new code for partition pruning, but there is much refactoring work
leftto avoid needlessly locking and opening all partitions. 

Thanks,
Amit






Re: How to make partitioning scale better for larger numbers ofpartitions

From
Amit Langote
Date:
On 2018/07/13 16:29, Kato, Sho wrote:
> I also benchmark PG10.
> Actually, SELECT latency on PG11beta2 + patch1 is faster than PG10.
> 
> SELECT latency with 800 leaf partition
> --------------------------------------
> PG10 5.62 ms
> PG11  3.869 ms  
> 
> But, even PG11, SELECT statement takes 21.102ms on benchmark with 1600 leaf partitions.
> It takes a long time though partition pruning algorithm of PG11 is binary search.

Yeah, pruning algorithm change evidently removed only a small percentage
of the overhead.

>> The overheads I mention stem from the fact that for partitioning we still rely on the old planner code that's used
toperform inheritance planning, which requires to lock and open *all* partitions.
 
> 
> I debug update statement execution on partitioned table.
> range_table_mutator seems process all leaf partitions.

That's one of the the symptoms of it, yes.

With the existing code for UPDATE/DELETE planning of partitioned tables,
the whole Query structure is modified to translate the parent's attribute
numbers to partition attribute numbers and planned freshly *for every
partition*.  range_table_mutator() is invoked sometime during that
translation process and itself loops over all partitions, so I'd think it
is susceptible to being prominently visible in perf profiles.

Thanks,
Amit



RE: How to make partitioning scale better for larger numbers ofpartitions

From
"Kato, Sho"
Date:
Tsunakawa-san

>Kato-san, could you try pgbench -M prepared?

I did pgbench -M prepared and perf record.

UPDATE latency in prepared mode is 95% shorter than in simple mode. 
SELECT latency in prepared mode is 54% shorter than in simple mode.
INSERT latency in prepared mode is 8% shorter than in simple mode.

In perf report, AllocSetAlloc, hash_search_with_hash_value and hash_any appeared in all SQL.

Details are as follows.

pgbench results
--------------

transaction type: update.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 12295
latency average = 14.641 ms
tps = 68.302806 (including connections establishing)
tps = 68.303430 (excluding connections establishing)
statement latencies in milliseconds:
         0.009  \set aid random(1, 1100 * 1)
         0.004  \set delta random(-5000, 5000)
         0.026  BEGIN;
        14.089  UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.513  END;

transaction type: select.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 45145
latency average = 3.996 ms
tps = 250.272094 (including connections establishing)
tps = 250.274404 (excluding connections establishing)
statement latencies in milliseconds:
         0.750  \set aid random(1, 1100 * 1)
         0.714  \set delta random(-5000, 5000)
         0.023  BEGIN;
         2.262  SELECT abalance FROM test.accounts WHERE aid = :aid;
         0.247  END;

transaction type: insert.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 34894
latency average = 5.158 ms
tps = 193.855216 (including connections establishing)
tps = 193.857020 (excluding connections establishing)
statement latencies in milliseconds:
         1.007  \set aid random(1, 1100 * 1)
         0.802  \set delta random(-5000, 5000)
         0.025  BEGIN;
         2.963  INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
         0.360  END;

Top 10 of perf report
---------------------
UPDATE:
 11.86%  postgres  postgres           [.] list_nth
 10.23%  postgres  postgres           [.] ExecOpenScanRelation
  7.47%  postgres  postgres           [.] list_member_int
  4.78%  postgres  postgres           [.] AllocSetAlloc
  3.61%  postgres  postgres           [.] palloc0
  3.09%  postgres  postgres           [.] hash_search_with_hash_value
  2.33%  postgres  postgres           [.] ResourceArrayAdd
  1.99%  postgres  postgres           [.] hash_any
  1.83%  postgres  postgres           [.] copyObjectImpl
  1.64%  postgres  postgres           [.] SearchCatCache1

SELECT:
 33.60%  postgres  postgres           [.] hash_search_with_hash_value
 13.02%  postgres  postgres           [.] hash_any
  5.30%  postgres  postgres           [.] LockAcquireExtended
  5.16%  postgres  postgres           [.] LockReleaseAll
  4.00%  postgres  postgres           [.] hash_seq_search
  3.84%  postgres  postgres           [.] LWLockRelease
  3.81%  postgres  postgres           [.] AllocSetAlloc
  3.23%  postgres  postgres           [.] LWLockAcquire
  2.55%  postgres  postgres           [.] SetupLockInTable
  1.90%  postgres  postgres           [.] AcquireExecutorLocks

INSERT:
 21.86%  postgres  postgres           [.] hash_search_with_hash_value
  6.36%  postgres  postgres           [.] hash_any
  4.95%  postgres  postgres           [.] AllocSetAlloc
  4.08%  postgres  postgres           [.] LWLockRelease
  3.83%  postgres  postgres           [.] LWLockAcquire
  3.26%  postgres  postgres           [.] SearchCatCache
  3.15%  postgres  postgres           [.] oid_cmp
  2.78%  postgres  postgres           [.] LockReleaseAll
  2.76%  postgres  postgres           [.] pg_qsort
  2.41%  postgres  postgres           [.] SearchCatCache1


-----Original Message-----
From: Tsunakawa, Takayuki/綱川 貴之 
Sent: Friday, July 13, 2018 2:49 PM
To: 'Amit Langote' <Langote_Amit_f8@lab.ntt.co.jp>; Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; PostgreSQL mailing lists
<pgsql-hackers@postgresql.org>
Subject: RE: How to make partitioning scale better for larger numbers of partitions

From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
> is pretty significant and gets worse as the number of partitions grows.
> I
> had intended to fix that in PG 11, but we could only manage to get part
> of
> that work into PG 11, with significant help from David and others.  So,
> while PG 11's overhead of partitioning during planning is less than PG
> 10's due to improved pruning algorithm, it's still pretty far from ideal,
> because it isn't just the pruning algorithm that had overheads.  In fact,
> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
> carry the overhead that was in PG 10.

David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it in
thecurrent CF, though.)  What challenges are there for future versions, and which of them are being addressed by
patchesin progress for PG 12, and which issues are untouched?
 

> The overheads I mention stem from
> the fact that for partitioning we still rely on the old planner code
> that's used to perform inheritance planning, which requires to lock and
> open *all* partitions.  We have so far been able to refactor just enough
> to use the new code for partition pruning, but there is much refactoring
> work left to avoid needlessly locking and opening all partitions.

I remember the issue of opening and locking all partitions was discussed before.  Does this relate only to the planning
phase?

Kato-san, could you try pgbench -M prepared?


Regards
Takayuki Tsunakawa





Re: How to make partitioning scale better for larger numbers of partitions

From
Ashutosh Bapat
Date:
On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
>>I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule
or>INSERT/UPDATE directly into child) ?
 
>
> Thank you for your reply.
>
> I compared to PG11beta2 with non-partitioned table.
>
> Non-partitioned table has 1100 records in one table.
> Partitioned table has one record on each leaf partitions.
>

I don't think partitioning should be employed this way even for the
sake of comparison. Depending upon the size of each tuple, 1100 tuples
are inserted into a single table, they will probably occupy few
hundred pages. In a partitioned table with one tuple per partition
they will occupy 1100 pages at least. There is other space, locking
overheads to maintain 1100 tables. I think the right way to compare is
to have really large that which really requires 1100 partitions and
then compare performance by putting that data in 1100 partitions and
in an unpartitioned table. Even with that kind of data, you will see
some difference in performance, but that won't be as dramatic as you
report.

I might be missing something though.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


RE: How to make partitioning scale better for larger numbers ofpartitions

From
"Tsunakawa, Takayuki"
Date:
From: Kato, Sho [mailto:kato-sho@jp.fujitsu.com]
> I did pgbench -M prepared and perf record.
> 
> UPDATE latency in prepared mode is 95% shorter than in simple mode.
> SELECT latency in prepared mode is 54% shorter than in simple mode.
> INSERT latency in prepared mode is 8% shorter than in simple mode.

Thanks.  And what does the comparison look like between the unpartitioned case and various partition counts?  What's
theperformance characteristics in terms of the latency and partition count?   I thought that's what you tried to reveal
first?

Regards
Takayuki Tsunakawa




Re: How to make partitioning scale better for larger numbers ofpartitions

From
Amit Langote
Date:
On 2018/07/13 22:10, Ashutosh Bapat wrote:
> On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
>>> I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or
ruleor >INSERT/UPDATE directly into child) ?
 
>>
>> Thank you for your reply.
>>
>> I compared to PG11beta2 with non-partitioned table.
>>
>> Non-partitioned table has 1100 records in one table.
>> Partitioned table has one record on each leaf partitions.
>>
> 
> I don't think partitioning should be employed this way even for the
> sake of comparison. Depending upon the size of each tuple, 1100 tuples
> are inserted into a single table, they will probably occupy few
> hundred pages. In a partitioned table with one tuple per partition
> they will occupy 1100 pages at least. There is other space, locking
> overheads to maintain 1100 tables. I think the right way to compare is
> to have really large that which really requires 1100 partitions and
> then compare performance by putting that data in 1100 partitions and
> in an unpartitioned table. Even with that kind of data, you will see
> some difference in performance, but that won't be as dramatic as you
> report.
> 
> I might be missing something though.

Perhaps, Kato-san only intended to report that the time that planner
spends for a partitioned table with 1100 partitions is just too high
compared to the time it spends on a non-partitioned table.  It is and has
been clear that that's because the planning time explodes as the number of
partitions increases.

If there's lots of data in it, then the result will look completely
different as you say, because scanning a single partition (of the 1100
total) will spend less time than scanning a non-partitioned table
containing 1100 partitions worth of data.  But the planning time would
still be more for the partitioned table, which seems to be the point of
this benchmark.

Thanks,
Amit



RE: How to make partitioning scale better for larger numbers ofpartitions

From
"Kato, Sho"
Date:
On 2018/07/17 10:49, Amit Langote wrote:
>Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100
partitionsis just too high compared to the time it spends on a non-partitioned table.
 

yes, It is included for the purposes of this comparison.

The purpose of this comparison is to find where the partitioning bottleneck is.
Using the bottleneck as a hint of improvement, I'd like to bring the performance of partitioned table closer to the
performanceof unpartitioned table as much as possible.
 

-----Original Message-----
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] 
Sent: Tuesday, July 17, 2018 10:49 AM
To: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>; Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>
Cc: Justin Pryzby <pryzby@telsasoft.com>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org>
Subject: Re: How to make partitioning scale better for larger numbers of partitions

On 2018/07/13 22:10, Ashutosh Bapat wrote:
> On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
>>> I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or
ruleor >INSERT/UPDATE directly into child) ?
 
>>
>> Thank you for your reply.
>>
>> I compared to PG11beta2 with non-partitioned table.
>>
>> Non-partitioned table has 1100 records in one table.
>> Partitioned table has one record on each leaf partitions.
>>
> 
> I don't think partitioning should be employed this way even for the 
> sake of comparison. Depending upon the size of each tuple, 1100 tuples 
> are inserted into a single table, they will probably occupy few 
> hundred pages. In a partitioned table with one tuple per partition 
> they will occupy 1100 pages at least. There is other space, locking 
> overheads to maintain 1100 tables. I think the right way to compare is 
> to have really large that which really requires 1100 partitions and 
> then compare performance by putting that data in 1100 partitions and 
> in an unpartitioned table. Even with that kind of data, you will see 
> some difference in performance, but that won't be as dramatic as you 
> report.
> 
> I might be missing something though.

Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100
partitionsis just too high compared to the time it spends on a non-partitioned table.  It is and has been clear that
that'sbecause the planning time explodes as the number of partitions increases.
 

If there's lots of data in it, then the result will look completely different as you say, because scanning a single
partition(of the 1100
 
total) will spend less time than scanning a non-partitioned table containing 1100 partitions worth of data.  But the
planningtime would still be more for the partitioned table, which seems to be the point of this benchmark.
 

Thanks,
Amit




Re: How to make partitioning scale better for larger numbers of partitions

From
Ashutosh Bapat
Date:
On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
> On 2018/07/17 10:49, Amit Langote wrote:
>>Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100
partitionsis just too high compared to the time it spends on a non-partitioned table.
 
>
> yes, It is included for the purposes of this comparison.
>
> The purpose of this comparison is to find where the partitioning bottleneck is.
> Using the bottleneck as a hint of improvement, I'd like to bring the performance of partitioned table closer to the
performanceof unpartitioned table as much as possible.
 
>

That's a good thing, but may not turn out to be realistic. We should
compare performance where partitioning matters and try to improve in
those contexts. Else we might improve performance in scenarios which
are never used.

In this case, even if we improve the planning time by 100%, it hardly
matters since planning time is neglegible compared to the execution
time because of huge data where partitioning is useful.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: How to make partitioning scale better for larger numbers ofpartitions

From
Amit Langote
Date:
On 2018/07/17 12:14, Ashutosh Bapat wrote:
> On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:
>> On 2018/07/17 10:49, Amit Langote wrote:
>>> Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100
partitionsis just too high compared to the time it spends on a non-partitioned table.
 
>>
>> yes, It is included for the purposes of this comparison.
>>
>> The purpose of this comparison is to find where the partitioning bottleneck is.
>> Using the bottleneck as a hint of improvement, I'd like to bring the performance of partitioned table closer to the
performanceof unpartitioned table as much as possible.
 
>>
> 
> That's a good thing, but may not turn out to be realistic. We should
> compare performance where partitioning matters and try to improve in
> those contexts. Else we might improve performance in scenarios which
> are never used.
> 
> In this case, even if we improve the planning time by 100%, it hardly
> matters since planning time is neglegible compared to the execution
> time because of huge data where partitioning is useful.

While I agree that it's a good idea to tell users to use partitioning only
if the overhead of having the partitioning in the first place is bearable,
especially the planner overhead, this benchmark shows us that even for
what I assume might be fairly commonly occurring queries (select .. from /
update .. / delete from partitioned_table where partkey = ?), planner
spends way too many redundant cycles.  Some amount of that overhead will
always remain and planning with partitioning will always be a bit slower
than without partitioning, it's *too* slow right now for non-trivial
number of partitions.

Thanks,
Amit



RE: How to make partitioning scale better for larger numbers ofpartitions

From
"Kato, Sho"
Date:
On 2018/07/16 13:16, Tsunakawa-san wrote:
>Thanks.  And what does the comparison look like between the unpartitioned case and various partition counts?  What's
theperformance characteristics in terms of the latency and partition count?   I thought that's what you tried to reveal
first?

In unpartitioned table, latency of SELECT/UPDATE statement is close to O(n), where n is number of records.
Latency of INSERT statements is close to O(1).

In partitioned table, up to 400 partitions, latency of SELECT/INSERT statement is close to O(log n), where n is the
numberof partitions.
 
Between 400 and 6400 partitions, latency is close to O(n).
Up to 400 partitions, latency of UPDATE statement is close to O(n).
Between 400 and 6400 partitions, latency of UPDATE statement seems to O(n^2).

Details are as follows.

unpartitioned table result (prepared mode)
------------------------------------------

 scale | latency_avg |   tps_ex    | update_latency | select_latency | insert_latency 
-------+-------------+-------------+----------------+----------------+----------------
   100 |        0.24 | 4174.395738 |          0.056 |          0.051 |           0.04
   200 |       0.258 | 3873.099014 |          0.065 |          0.059 |           0.04
   400 |        0.29 | 3453.171112 |          0.081 |          0.072 |          0.041
   800 |       0.355 | 2814.936942 |          0.112 |          0.105 |          0.041
  1600 |       0.493 | 2027.702689 |           0.18 |          0.174 |          0.042
  3200 |       0.761 | 1313.532458 |          0.314 |          0.307 |          0.043
  6400 |       1.214 |  824.001431 |           0.54 |          0.531 |          0.043


partitioned talble result (prepared mode)
-----------------------------------------

 num_part | latency_avg |   tps_ex    | update_latency | select_latency | insert_latency 
----------+-------------+-------------+----------------+----------------+----------------
      100 |       0.937 | 1067.473258 |          0.557 |          0.087 |          0.123
      200 |        1.65 |  606.244552 |          1.115 |          0.121 |          0.188
      400 |       3.295 |  303.491681 |          2.446 |           0.19 |          0.322
      800 |       8.102 |  123.422456 |          6.553 |          0.337 |          0.637
     1600 |      36.996 |   27.030027 |         31.528 |          1.621 |          2.455
     3200 |     147.998 |    6.756922 |        136.136 |           4.21 |           4.94
     6400 |     666.947 |    1.499383 |        640.879 |          7.631 |          9.642

regards,
-----Original Message-----
From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com] 
Sent: Monday, July 16, 2018 1:16 PM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>
Cc: 'Amit Langote' <Langote_Amit_f8@lab.ntt.co.jp>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org>
Subject: RE: How to make partitioning scale better for larger numbers of partitions

From: Kato, Sho [mailto:kato-sho@jp.fujitsu.com]
> I did pgbench -M prepared and perf record.
> 
> UPDATE latency in prepared mode is 95% shorter than in simple mode.
> SELECT latency in prepared mode is 54% shorter than in simple mode.
> INSERT latency in prepared mode is 8% shorter than in simple mode.

Thanks.  And what does the comparison look like between the unpartitioned case and various partition counts?  What's
theperformance characteristics in terms of the latency and partition count?   I thought that's what you tried to reveal
first?

Regards
Takayuki Tsunakawa