Thread: [HACKERS] Declarative partitioning optimization for large amount of partitions

[HACKERS] Declarative partitioning optimization for large amount of partitions

From
Aleksander Alekseev
Date:
Hello.

I decided to figure out whether current implementation of declarative
partitioning has any bottlenecks when there is a lot of partitions. Here
is what I did [1].

```
-- init schema

\timing on

CREATE TABLE part_test (pk int not null, k int, v varchar(128)) PARTITION BY RANGE(pk);

do $$
declare   i integer;
begin   for i in 1 .. 10000   loop       raise notice 'i = %', i;       execute ('CREATE TABLE part_test_' || i ||
         ' PARTITION OF part_test FOR VALUES FROM (' ||                (1 + (i-1)*1000) || ') to (' || ( (i * 1000) +
1)|| ');'               );   end loop; 
end $$;

-- fill tables with some data

do $$
declare   i integer;
begin   for i in 1 .. 100*1000   loop       raise notice 'i = %', i;       execute ('insert into part_test values (
ceil(random()*(10000-1)*1000),ceil(random()*10000*1000), '''' || ceil(random()*10000*1000) );');   end loop; 
end $$;
```

Then:

```
# 2580 is some pk that exists
echo 'select * from part_test where pk = 2580;' > t.sql
pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax
```

`perf top` showed to bottlenecks [2]. A stacktrace for the first one
looks like this [3]:

```
0x00000000007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at pgstat.c:1689
1689                if (entry->t_id == rel_id)
#0  0x00000000007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at pgstat.c:1689
#1  0x00000000007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at pgstat.c:1666
#2  0x00000000004c7090 in relation_open (relationId=25696, lockmode=0) at heapam.c:1137
#3  0x00000000004c72c9 in heap_open (relationId=25696, lockmode=0) at heapam.c:1291
(skipped)
```

And here is a stacktrace for the second bottleneck [4]:

```
0x0000000000584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, numparents=0x0) at pg_inherits.c:199
199                forboth(lo, rels_list, li, rel_numparents)
#0  0x0000000000584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, numparents=0x0) at pg_inherits.c:199
#1  0x000000000077fc9f in expand_inherited_rtentry (root=0x1badcb8, rte=0x1b630b8, rti=1) at prepunion.c:1408
#2  0x000000000077fb67 in expand_inherited_tables (root=0x1badcb8) at prepunion.c:1335
#3  0x0000000000767526 in subquery_planner (glob=0x1b63cc0, parse=0x1b62fa0, parent_root=0x0, hasRecursion=0 '\000',
tuple_fraction=0)at planner.c:568 
(skipped)
```

The first one could be easily fixed by introducing a hash table
(rel_id -> pgStatList entry). Perhaps hash table should be used only
after some threshold. Unless there are any objections I will send a
corresponding patch shortly.

I didn't explored the second bottleneck closely yet but at first glance
it doesn't look much more complicated.

Please don't hesitate to share your thoughts regarding this matter.

[1] http://afiskon.ru/s/e3/5f47af9102_benchmark.txt
[2] http://afiskon.ru/s/00/2008c4ae66_temp.png
[3] http://afiskon.ru/s/23/650f0afc89_stack.txt
[4] http://afiskon.ru/s/03/a7e685a4db_stack2.txt

--
Best regards,
Aleksander Alekseev

Hi,

On 2017/02/28 23:25, Aleksander Alekseev wrote:
> Hello.
> 
> I decided to figure out whether current implementation of declarative
> partitioning has any bottlenecks when there is a lot of partitions. Here
> is what I did [1].

Thanks for sharing.

> Then:
> 
> ```
> # 2580 is some pk that exists
> echo 'select * from part_test where pk = 2580;' > t.sql
> pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax
> ```
> 
> `perf top` showed to bottlenecks [2]. A stacktrace for the first one
> looks like this [3]:
> 
> ```
> 0x00000000007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at pgstat.c:1689
> 1689                if (entry->t_id == rel_id)
> #0  0x00000000007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at pgstat.c:1689
> #1  0x00000000007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at pgstat.c:1666
> #2  0x00000000004c7090 in relation_open (relationId=25696, lockmode=0) at heapam.c:1137
> #3  0x00000000004c72c9 in heap_open (relationId=25696, lockmode=0) at heapam.c:1291
> (skipped)
> ```
> 
> And here is a stacktrace for the second bottleneck [4]:
> 
> ```
> 0x0000000000584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, numparents=0x0) at pg_inherits.c:199
> 199                forboth(lo, rels_list, li, rel_numparents)
> #0  0x0000000000584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, numparents=0x0) at pg_inherits.c:199
> #1  0x000000000077fc9f in expand_inherited_rtentry (root=0x1badcb8, rte=0x1b630b8, rti=1) at prepunion.c:1408
> #2  0x000000000077fb67 in expand_inherited_tables (root=0x1badcb8) at prepunion.c:1335
> #3  0x0000000000767526 in subquery_planner (glob=0x1b63cc0, parse=0x1b62fa0, parent_root=0x0, hasRecursion=0 '\000',
tuple_fraction=0)at planner.c:568
 
> (skipped)
> ```
> 
> The first one could be easily fixed by introducing a hash table
> (rel_id -> pgStatList entry). Perhaps hash table should be used only
> after some threshold. Unless there are any objections I will send a
> corresponding patch shortly.

I have never thought about this one really.

> I didn't explored the second bottleneck closely yet but at first glance
> it doesn't look much more complicated.

I don't know which way you're thinking of fixing this, but a planner patch
to implement faster partition-pruning will have taken care of this, I
think.  As you may know, even declarative partitioned tables currently
depend on constraint exclusion for partition-pruning and planner's current
approach of handling inheritance requires to open all the child tables
(partitions), whereas the new approach hopefully shouldn't need to do
that.  I am not sure if looking for a more localized fix for this would be
worthwhile, although I may be wrong.

Thanks,
Amit





Re: [HACKERS] Declarative partitioning optimization for large amountof partitions

From
Aleksander Alekseev
Date:
Hello.

OK, here is a patch.

Benchmark, before:

```
number of transactions actually processed: 1823
latency average = 1153.495 ms
latency stddev = 154.366 ms
tps = 6.061104 (including connections establishing)
tps = 6.061211 (excluding connections establishing)
```

Benchmark, after:

```
number of transactions actually processed: 2462
latency average = 853.862 ms
latency stddev = 112.270 ms
tps = 8.191861 (including connections establishing)
tps = 8.192028 (excluding connections establishing)
```

+35% TPS, just as expected. Feel free to run your own benchmarks on
different datasets and workloads. `perf top` shows that first bottleneck
is completely eliminated. I did nothing about the second bottleneck
since as Amit mentioned partition-pruning should solve this anyway and
apparently any micro-optimizations don't worth an effort.

Also I checked test coverage using lcov to make sure that this code is
test covered. An exact script I'm using could be found here [1].

[1] https://github.com/afiskon/pgscripts/blob/master/code-coverage.sh

On Wed, Mar 01, 2017 at 10:36:24AM +0900, Amit Langote wrote:
> Hi,
>
> On 2017/02/28 23:25, Aleksander Alekseev wrote:
> > Hello.
> >
> > I decided to figure out whether current implementation of declarative
> > partitioning has any bottlenecks when there is a lot of partitions. Here
> > is what I did [1].
>
> Thanks for sharing.
>
> > Then:
> >
> > ```
> > # 2580 is some pk that exists
> > echo 'select * from part_test where pk = 2580;' > t.sql
> > pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax
> > ```
> >
> > `perf top` showed to bottlenecks [2]. A stacktrace for the first one
> > looks like this [3]:
> >
> > ```
> > 0x00000000007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at pgstat.c:1689
> > 1689                if (entry->t_id == rel_id)
> > #0  0x00000000007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at pgstat.c:1689
> > #1  0x00000000007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at pgstat.c:1666
> > #2  0x00000000004c7090 in relation_open (relationId=25696, lockmode=0) at heapam.c:1137
> > #3  0x00000000004c72c9 in heap_open (relationId=25696, lockmode=0) at heapam.c:1291
> > (skipped)
> > ```
> >
> > And here is a stacktrace for the second bottleneck [4]:
> >
> > ```
> > 0x0000000000584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, numparents=0x0) at pg_inherits.c:199
> > 199                forboth(lo, rels_list, li, rel_numparents)
> > #0  0x0000000000584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, numparents=0x0) at pg_inherits.c:199
> > #1  0x000000000077fc9f in expand_inherited_rtentry (root=0x1badcb8, rte=0x1b630b8, rti=1) at prepunion.c:1408
> > #2  0x000000000077fb67 in expand_inherited_tables (root=0x1badcb8) at prepunion.c:1335
> > #3  0x0000000000767526 in subquery_planner (glob=0x1b63cc0, parse=0x1b62fa0, parent_root=0x0, hasRecursion=0
'\000',tuple_fraction=0) at planner.c:568 
> > (skipped)
> > ```
> >
> > The first one could be easily fixed by introducing a hash table
> > (rel_id -> pgStatList entry). Perhaps hash table should be used only
> > after some threshold. Unless there are any objections I will send a
> > corresponding patch shortly.
>
> I have never thought about this one really.
>
> > I didn't explored the second bottleneck closely yet but at first glance
> > it doesn't look much more complicated.
>
> I don't know which way you're thinking of fixing this, but a planner patch
> to implement faster partition-pruning will have taken care of this, I
> think.  As you may know, even declarative partitioned tables currently
> depend on constraint exclusion for partition-pruning and planner's current
> approach of handling inheritance requires to open all the child tables
> (partitions), whereas the new approach hopefully shouldn't need to do
> that.  I am not sure if looking for a more localized fix for this would be
> worthwhile, although I may be wrong.
>
> Thanks,
> Amit
>
>

--
Best regards,
Aleksander Alekseev

Attachment
Hi,

This issue has bothered me in non-partitioned use-cases recently, so
thanks for taking it up.


On 2017-03-06 18:22:17 +0300, Aleksander Alekseev wrote:
> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> index 2fb9a8bf58..fa906e7930 100644
> --- a/src/backend/postmaster/pgstat.c
> +++ b/src/backend/postmaster/pgstat.c
> @@ -160,6 +160,16 @@ typedef struct TabStatusArray
>  
>  static TabStatusArray *pgStatTabList = NULL;

Why are we keeping that list / the "batch" allocation pattern? It
doesn't actually seem useful to me after these changes.  Given that we
don't shrink hash-tables, we could just use the hash-table memory for
this, no?

I think a separate list that only contains things that are "active" in
the current transaction makes sense, because the current logic requires
iterating over a potentially very large array at transaction commit.


> +/* pgStatTabHash entry */
> +typedef struct TabStatHashEntry
> +{
> +    Oid t_id;
> +    PgStat_TableStatus* tsa_entry;
> +} TabStatHashEntry;
> +
> +/* Hash table for faster t_id -> tsa_entry lookup */
> +static HTAB *pgStatTabHash = NULL;
> +

'faster ... lookup' doesn't strike me as a useful comment, because it's
only useful in relation of the current code to the new code.



>  /*
>   * Backends store per-function info that's waiting to be sent to the collector
>   * in this hash table (indexed by function OID).
> @@ -815,7 +825,13 @@ pgstat_report_stat(bool force)
>                  pgstat_send_tabstat(this_msg);
>                  this_msg->m_nentries = 0;
>              }
> +
> +            /*
> +             * Entry will be freed soon so we need to remove it from the lookup table.
> +             */
> +            hash_search(pgStatTabHash, &entry->t_id, HASH_REMOVE, NULL);
>          }

It's not really freed...


>  static PgStat_TableStatus *
>  get_tabstat_entry(Oid rel_id, bool isshared)
>  {
> +    TabStatHashEntry* hash_entry;
>      PgStat_TableStatus *entry;
>      TabStatusArray *tsa;
> -    TabStatusArray *prev_tsa;
> -    int            i;
> +
> +    /* Try to find an entry */
> +    entry = find_tabstat_entry(rel_id);
> +    if(entry != NULL)
> +        return entry;

Arguably it'd be better to HASH_ENTER directly here, instead of doing
two lookups.


>      /*
> -     * Search the already-used tabstat slots for this relation.
> +     * Entry doesn't exist - creating one.
> +     * First make sure there is a free space in a last element of pgStatTabList.
>       */
> -    prev_tsa = NULL;
> -    for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = tsa->tsa_next)
> +    if (!pgStatTabList)
>      {
> -        for (i = 0; i < tsa->tsa_used; i++)
> -        {
> -            entry = &tsa->tsa_entries[i];
> -            if (entry->t_id == rel_id)
> -                return entry;
> -        }
> +        HASHCTL     ctl;
>  
> -        if (tsa->tsa_used < TABSTAT_QUANTUM)
> +        Assert(!pgStatTabHash);
> +
> +        memset(&ctl, 0, sizeof(ctl));
> +        ctl.keysize = sizeof(Oid);
> +        ctl.entrysize = sizeof(TabStatHashEntry);
> +        ctl.hcxt = TopMemoryContext;
> +
> +        pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup table",
> +            TABSTAT_QUANTUM, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);

Think it'd be better to move the hash creation into its own function.


- Andres



Hi Aleksander,

On 2017/03/07 0:22, Aleksander Alekseev wrote:
> Hello.
> 
> OK, here is a patch.
> 
> Benchmark, before:
> 
> ```
> number of transactions actually processed: 1823
> latency average = 1153.495 ms
> latency stddev = 154.366 ms
> tps = 6.061104 (including connections establishing)
> tps = 6.061211 (excluding connections establishing)
> ```
> 
> Benchmark, after:
> 
> ```
> number of transactions actually processed: 2462
> latency average = 853.862 ms
> latency stddev = 112.270 ms
> tps = 8.191861 (including connections establishing)
> tps = 8.192028 (excluding connections establishing)
> ```
> 
> +35% TPS, just as expected. Feel free to run your own benchmarks on
> different datasets and workloads. `perf top` shows that first bottleneck
> is completely eliminated.

That seems like a good gain.

> I did nothing about the second bottleneck
> since as Amit mentioned partition-pruning should solve this anyway and
> apparently any micro-optimizations don't worth an effort.

Sorry, I didn't mean to dissuade you from trying those
micro-optimizations.  If general inheritance cases can benefit from it
(which, until we have a different method, will be used by partitioned
tables as well), I think we should try it.

Thanks,
Amit





Re: [HACKERS] Declarative partitioning optimization for large amountof partitions

From
Aleksander Alekseev
Date:
Hi, Andres

Thanks a lot for the review!

> Why are we keeping that list / the "batch" allocation pattern? It
> doesn't actually seem useful to me after these changes.  Given that we
> don't shrink hash-tables, we could just use the hash-table memory for
> this, no?

I don't think we can do that. According to comments:

```
 * NOTE: once allocated, TabStatusArray structures are never moved or deleted
 [...]
 * This allows relcache pgstat_info pointers to be treated as long-lived data,
 * avoiding repeated searches in pgstat_initstats() when a relation is
 * repeatedly opened during a transaction.
```

It is my understanding that dynahash can't give same guarantees.

> 'faster ... lookup' doesn't strike me as a useful comment, because it's
> only useful in relation of the current code to the new code.

Agree, fixed to 'O(1) ... lookup'.

> It's not really freed...

Right, it's actually zeroed. Fixed.

> Arguably it'd be better to HASH_ENTER directly here, instead of doing
> two lookups.

Good point. Fixed.

> Think it'd be better to move the hash creation into its own function.

Agree, fixed.

On Mon, Mar 06, 2017 at 12:01:50PM -0800, Andres Freund wrote:
> Hi,
>
> This issue has bothered me in non-partitioned use-cases recently, so
> thanks for taking it up.
>
>
> On 2017-03-06 18:22:17 +0300, Aleksander Alekseev wrote:
> > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> > index 2fb9a8bf58..fa906e7930 100644
> > --- a/src/backend/postmaster/pgstat.c
> > +++ b/src/backend/postmaster/pgstat.c
> > @@ -160,6 +160,16 @@ typedef struct TabStatusArray
> >
> >  static TabStatusArray *pgStatTabList = NULL;
>
> Why are we keeping that list / the "batch" allocation pattern? It
> doesn't actually seem useful to me after these changes.  Given that we
> don't shrink hash-tables, we could just use the hash-table memory for
> this, no?
>
> I think a separate list that only contains things that are "active" in
> the current transaction makes sense, because the current logic requires
> iterating over a potentially very large array at transaction commit.
>
>
> > +/* pgStatTabHash entry */
> > +typedef struct TabStatHashEntry
> > +{
> > +    Oid t_id;
> > +    PgStat_TableStatus* tsa_entry;
> > +} TabStatHashEntry;
> > +
> > +/* Hash table for faster t_id -> tsa_entry lookup */
> > +static HTAB *pgStatTabHash = NULL;
> > +
>
> 'faster ... lookup' doesn't strike me as a useful comment, because it's
> only useful in relation of the current code to the new code.
>
>
>
> >  /*
> >   * Backends store per-function info that's waiting to be sent to the collector
> >   * in this hash table (indexed by function OID).
> > @@ -815,7 +825,13 @@ pgstat_report_stat(bool force)
> >                  pgstat_send_tabstat(this_msg);
> >                  this_msg->m_nentries = 0;
> >              }
> > +
> > +            /*
> > +             * Entry will be freed soon so we need to remove it from the lookup table.
> > +             */
> > +            hash_search(pgStatTabHash, &entry->t_id, HASH_REMOVE, NULL);
> >          }
>
> It's not really freed...
>
>
> >  static PgStat_TableStatus *
> >  get_tabstat_entry(Oid rel_id, bool isshared)
> >  {
> > +    TabStatHashEntry* hash_entry;
> >      PgStat_TableStatus *entry;
> >      TabStatusArray *tsa;
> > -    TabStatusArray *prev_tsa;
> > -    int            i;
> > +
> > +    /* Try to find an entry */
> > +    entry = find_tabstat_entry(rel_id);
> > +    if(entry != NULL)
> > +        return entry;
>
> Arguably it'd be better to HASH_ENTER directly here, instead of doing
> two lookups.
>
>
> >      /*
> > -     * Search the already-used tabstat slots for this relation.
> > +     * Entry doesn't exist - creating one.
> > +     * First make sure there is a free space in a last element of pgStatTabList.
> >       */
> > -    prev_tsa = NULL;
> > -    for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = tsa->tsa_next)
> > +    if (!pgStatTabList)
> >      {
> > -        for (i = 0; i < tsa->tsa_used; i++)
> > -        {
> > -            entry = &tsa->tsa_entries[i];
> > -            if (entry->t_id == rel_id)
> > -                return entry;
> > -        }
> > +        HASHCTL     ctl;
> >
> > -        if (tsa->tsa_used < TABSTAT_QUANTUM)
> > +        Assert(!pgStatTabHash);
> > +
> > +        memset(&ctl, 0, sizeof(ctl));
> > +        ctl.keysize = sizeof(Oid);
> > +        ctl.entrysize = sizeof(TabStatHashEntry);
> > +        ctl.hcxt = TopMemoryContext;
> > +
> > +        pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup table",
> > +            TABSTAT_QUANTUM, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
>
> Think it'd be better to move the hash creation into its own function.
>
>
> - Andres

--
Best regards,
Aleksander Alekseev

Attachment

Re: [HACKERS] Declarative partitioning optimization for large amountof partitions

From
Aleksander Alekseev
Date:
Hi Amit,

> Sorry, I didn't mean to dissuade you from trying those
> micro-optimizations.  If general inheritance cases can benefit from it
> (which, until we have a different method, will be used by partitioned
> tables as well), I think we should try it.

OK, I'll see what could be done here as well then.

On Tue, Mar 07, 2017 at 10:55:12AM +0900, Amit Langote wrote:
> Hi Aleksander,
>
> On 2017/03/07 0:22, Aleksander Alekseev wrote:
> > Hello.
> >
> > OK, here is a patch.
> >
> > Benchmark, before:
> >
> > ```
> > number of transactions actually processed: 1823
> > latency average = 1153.495 ms
> > latency stddev = 154.366 ms
> > tps = 6.061104 (including connections establishing)
> > tps = 6.061211 (excluding connections establishing)
> > ```
> >
> > Benchmark, after:
> >
> > ```
> > number of transactions actually processed: 2462
> > latency average = 853.862 ms
> > latency stddev = 112.270 ms
> > tps = 8.191861 (including connections establishing)
> > tps = 8.192028 (excluding connections establishing)
> > ```
> >
> > +35% TPS, just as expected. Feel free to run your own benchmarks on
> > different datasets and workloads. `perf top` shows that first bottleneck
> > is completely eliminated.
>
> That seems like a good gain.
>
> > I did nothing about the second bottleneck
> > since as Amit mentioned partition-pruning should solve this anyway and
> > apparently any micro-optimizations don't worth an effort.
>
> Sorry, I didn't mean to dissuade you from trying those
> micro-optimizations.  If general inheritance cases can benefit from it
> (which, until we have a different method, will be used by partitioned
> tables as well), I think we should try it.
>
> Thanks,
> Amit
>
>

--
Best regards,
Aleksander Alekseev

Hi Aleksander,

noticed this in your patch:

> * Add a corespinding entry to pgStatTabHash.

"corresponding"

Also a question: Some one-line comments are
/* Comment. */

while others are
/* * Comment. */

Why the difference?

Hope this helps,

Tels



Hi Aleksander,

noticed this in your patch:

> * Add a corespinding entry to pgStatTabHash.

"corresponding"

Also a question: Some one-line comments are
/* Comment. */

while others are
/* * Comment. */

Why the difference?

Hope this helps,

Tels

PS: Sorry if this appears twice, I fumbled the From: ...




Re: [HACKERS] Declarative partitioning optimization for large amountof partitions

From
Aleksander Alekseev
Date:
Hi Tels,

Thanks a lot for the review!

> "corresponding"

Fixed.

> Also a question: Some one-line comments are
>
>  /* Comment. */
>
> while others are
>
>  /*
>   * Comment.
>   */
>
> Why the difference?

I'm trying to follow a code stile used in a code I'm modifying. In this
case I got an impression that second style of one-line comments is used
more often an I tried to follow it but apparently I didn't succeed :)
This is fixed as well.

On Thu, Mar 09, 2017 at 06:40:39PM -0500, Tels wrote:
> Hi Aleksander,
>
> noticed this in your patch:
>
> > * Add a corespinding entry to pgStatTabHash.
>
> "corresponding"
>
> Also a question: Some one-line comments are
>
>  /* Comment. */
>
> while others are
>
>  /*
>   * Comment.
>   */
>
> Why the difference?
>
> Hope this helps,
>
> Tels

--
Best regards,
Aleksander Alekseev

Attachment
On 3/10/17 7:16 AM, Aleksander Alekseev wrote:
>
> Thanks a lot for the review!

Anyone want to take a crack at reviewing this new version?

Thanks,
-- 
-David
david@pgmasters.net



[HACKERS] Re: Declarative partitioning optimization for large amount ofpartitions

From
Anastasia Lubennikova
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            not tested

The patch looks good to me.
It applies clearly, passes all the tests and eliminates the bottleneck described in the first message.
And as I can see from the thread, there were no objections from others,
except a few minor comments about code style, which are fixed in the last version of the patch.
So I mark it "Ready for committer".

The new status of this patch is: Ready for Committer

Re: Re: Declarative partitioning optimization for largeamount of partitions

From
Aleksander Alekseev
Date:
Hi Anastasia,

Thanks a lot for a review!

As was mentioned above there is also a bottleneck in find_all_inheritors
procedure. Turned out the problem there is similar and it could be easily
fixed as well. Corresponding patch is attached to this message. To keep
things in order I'm attaching the previous patch as well.

On Wed, Mar 22, 2017 at 11:53:45AM +0000, Anastasia Lubennikova wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           tested, passed
> Documentation:            not tested
>
> The patch looks good to me.
> It applies clearly, passes all the tests and eliminates the bottleneck described in the first message.
> And as I can see from the thread, there were no objections from others,
> except a few minor comments about code style, which are fixed in the last version of the patch.
> So I mark it "Ready for committer".
>
> The new status of this patch is: Ready for Committer
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Best regards,
Aleksander Alekseev

Attachment

Re: Re: Declarative partitioning optimization for largeamount of partitions

From
Aleksander Alekseev
Date:
Also I would like to share some benchmark results.

Before applying any patches (copied from one of previous messages):

```
latency average = 1153.495 ms
latency stddev = 154.366 ms
tps = 6.061104 (including connections establishing)
tps = 6.061211 (excluding connections establishing)
```

After applying first patch (copied as well):

```
latency average = 853.862 ms
latency stddev = 112.270 ms
tps = 8.191861 (including connections establishing)
tps = 8.192028 (excluding connections establishing)
```

After applying both patches:

```
latency average = 533.646 ms
latency stddev = 86.311 ms
tps = 13.110328 (including connections establishing)
tps = 13.110608 (excluding connections establishing)
```

Small amount of partitions (2 to be exact), no patches:

```
latency average = 0.928 ms
latency stddev = 0.296 ms
tps = 7535.224897 (including connections establishing)
tps = 7536.145457 (excluding connections establishing)
```

Small amount of partitions, bot patches applied:

```
latency average = 0.915 ms
latency stddev = 0.269 ms
tps = 7638.344922 (including connections establishing)
tps = 7639.164769 (excluding connections establishing)
```

As you can see these patches don't make things worse in any regard.

Perf shows that both bottlenecks are gone. Before [1] and after [2].

[1] http://afiskon.ru/s/00/2008c4ae66_temp.png
[2] http://afiskon.ru/s/a5/fd81628a3a_temp.png

On Fri, Mar 24, 2017 at 03:17:44PM +0300, Aleksander Alekseev wrote:
> Hi Anastasia,
>
> Thanks a lot for a review!
>
> As was mentioned above there is also a bottleneck in find_all_inheritors
> procedure. Turned out the problem there is similar and it could be easily
> fixed as well. Corresponding patch is attached to this message. To keep
> things in order I'm attaching the previous patch as well.

--
Best regards,
Aleksander Alekseev

Re: Declarative partitioning optimization for large amountof partitions

From
Simon Riggs
Date:
On 1 March 2017 at 01:36, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

> I don't know which way you're thinking of fixing this, but a planner patch
> to implement faster partition-pruning will have taken care of this, I
> think.  As you may know, even declarative partitioned tables currently
> depend on constraint exclusion for partition-pruning and planner's current
> approach of handling inheritance requires to open all the child tables
> (partitions), whereas the new approach hopefully shouldn't need to do
> that.  I am not sure if looking for a more localized fix for this would be
> worthwhile, although I may be wrong.

What "new approach" are we discussing?

Is there a patch or design discussion?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Declarative partitioning optimization for large amountof partitions

From
Aleksander Alekseev
Date:
Hi Simon,

> > I don't know which way you're thinking of fixing this, but a planner patch
> > to implement faster partition-pruning will have taken care of this, I
> > think.  As you may know, even declarative partitioned tables currently
> > depend on constraint exclusion for partition-pruning and planner's current
> > approach of handling inheritance requires to open all the child tables
> > (partitions), whereas the new approach hopefully shouldn't need to do
> > that.  I am not sure if looking for a more localized fix for this would be
> > worthwhile, although I may be wrong.
>
> What "new approach" are we discussing?
> Is there a patch or design discussion?

I think what was meant was plans of my colleague Dmitry Ivanov to
implement partition-pruning. I've just spoke with Dmitry about this
matter. Unless there is anyone else who is already working on this
optimization we would like to work on it together. Unfortunately there
is no patch or design discussion of partition-pruning on this
commitfest.

--
Best regards,
Aleksander Alekseev

Re: Re: Declarative partitioning optimization for largeamount of partitions

From
Teodor Sigaev
Date:
> things in order I'm attaching the previous patch as well.

Patches look good, but I have some notices:

1 step1 Why do you need TabStatHashEntry at all? TabStatHashEntry.t_id is never 
used for read, so entry for hash could be just a pointer to PgStat_TableStatus.

2 step1 In pgstat_report_stat() you remove one by one entries from hash and 
remove them all. Isn't it better to hash_destroy/hash_create or even let hash 
lives in separate memory context and just resets it?

3 step1 Again, pgstat_report_stat(), all-zero entries aren't deleted from hash 
although they will be free from point of view of pgStatTabList.


4 step 2. The same as 1) about SeenRelsEntry->rel_id but it even isn't 
initialized anywhere.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: Re: Declarative partitioning optimization for largeamount of partitions

From
Teodor Sigaev
Date:
Sorry, 1) and 4) is my fault, comment in hsearch.h:
* ... The hash key
* is expected to be at the start of the caller's hash entry data structure.

Ops, forgot that.

Teodor Sigaev wrote:
>> things in order I'm attaching the previous patch as well.
>
> Patches look good, but I have some notices:
>
> 1 step1 Why do you need TabStatHashEntry at all? TabStatHashEntry.t_id is never
> used for read, so entry for hash could be just a pointer to PgStat_TableStatus.
>
> 2 step1 In pgstat_report_stat() you remove one by one entries from hash and
> remove them all. Isn't it better to hash_destroy/hash_create or even let hash
> lives in separate memory context and just resets it?
>
> 3 step1 Again, pgstat_report_stat(), all-zero entries aren't deleted from hash
> although they will be free from point of view of pgStatTabList.
>
>
> 4 step 2. The same as 1) about SeenRelsEntry->rel_id but it even isn't
> initialized anywhere.
>
>
>

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: Re: Declarative partitioning optimization for largeamount of partitions

From
Aleksander Alekseev
Date:
Hi Teodor,

Thanks a lot for a review!

> > step1 In pgstat_report_stat() you remove one by one entries from hash and
> > remove them all. Isn't it better to hash_destroy/hash_create or even let hash
> > lives in separate memory context and just resets it?

Agree, fixed.

> > step1 Again, pgstat_report_stat(), all-zero entries aren't deleted from hash
> > although they will be free from point of view of pgStatTabList.

Good point! Fixed.

--
Best regards,
Aleksander Alekseev

Attachment

Re: Declarative partitioning optimization for large amountof partitions

From
Amit Langote
Date:
On Fri, Mar 24, 2017 at 11:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 1 March 2017 at 01:36, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
>> I don't know which way you're thinking of fixing this, but a planner patch
>> to implement faster partition-pruning will have taken care of this, I
>> think.  As you may know, even declarative partitioned tables currently
>> depend on constraint exclusion for partition-pruning and planner's current
>> approach of handling inheritance requires to open all the child tables
>> (partitions), whereas the new approach hopefully shouldn't need to do
>> that.  I am not sure if looking for a more localized fix for this would be
>> worthwhile, although I may be wrong.
>
> What "new approach" are we discussing?
>
> Is there a patch or design discussion?

Neither at the moment.  As Aleksander said in his reply I was
referring to Dmitry Ivanov's plan of porting pg_pathman's planner
functionality to core that he mentioned on the declarative
partitioning thread back in December [1].

Thanks,
Amit

[1] https://www.postgresql.org/message-id/426b2b01-61e0-43aa-bd84-c6fcf516f1c3%40postgrespro.ru