Thread: Slow planning time for simple query

Slow planning time for simple query

From
Jeremy Finzel
Date:
Hello - 

We have an odd scenario on one of our OLTP systems, which behaves the same way on a streamer, of a 700-1000ms planning time for a query like this:

SELECT *
FROM table1
WHERE  source_id IN (SELECT id FROM table2 WHERE customer_id = $1);

The actual execution time is sub-ms.

We initially thought maybe catalog bloat?  But we were able to reindex all of the highly churned catalog tables, and I even did VACUUM FULL on pg_attribute and pg_statistic, to no avail.

There are no custom settings for pg_attribute for the given tables either.

Interestingly, the problem goes away on a SAN snapshot of the target system.

Any ideas of what else we could try?  A PL function that caches the query plan works, but that is just a workaround.

Thanks!
Jeremy

Re: Slow planning time for simple query

From
Tom Lane
Date:
Jeremy Finzel <finzelj@gmail.com> writes:
> We have an odd scenario on one of our OLTP systems, which behaves the same
> way on a streamer, of a 700-1000ms planning time for a query like this:

> SELECT *
> FROM table1
> WHERE  source_id IN (SELECT id FROM table2 WHERE customer_id = $1);

Hm.  Is this the first query executed in a session?  If so maybe it's
got something to do with populating caches and other session spin-up
overhead.

Another theory is it's some sort of locking issue.  Turning on
log_lock_waits, and setting deadlock_timeout to say 100ms, would help
in investigating that.

            regards, tom lane


Re: Slow planning time for simple query

From
Jeremy Finzel
Date:


On Wed, Jun 6, 2018 at 12:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeremy Finzel <finzelj@gmail.com> writes:
> We have an odd scenario on one of our OLTP systems, which behaves the same
> way on a streamer, of a 700-1000ms planning time for a query like this:

> SELECT *
> FROM table1
> WHERE  source_id IN (SELECT id FROM table2 WHERE customer_id = $1);

Hm.  Is this the first query executed in a session?  If so maybe it's
got something to do with populating caches and other session spin-up
overhead.

Another theory is it's some sort of locking issue.  Turning on
log_lock_waits, and setting deadlock_timeout to say 100ms, would help
in investigating that.

                        regards, tom lane

I have run it over and over with no improvement in the planning time, so I don't thing it's first in session-related.  I can only make it faster with a pl function so far.

We have log_lock_waits on and nothing shows, and turning down deadlock_timeout also doesn't do anything.

Thanks,
Jeremy

Re: Slow planning time for simple query

From
Jeremy Finzel
Date:


On Wed, Jun 6, 2018 at 1:13 PM, Jeremy Finzel <finzelj@gmail.com> wrote:


On Wed, Jun 6, 2018 at 12:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeremy Finzel <finzelj@gmail.com> writes:
> We have an odd scenario on one of our OLTP systems, which behaves the same
> way on a streamer, of a 700-1000ms planning time for a query like this:

> SELECT *
> FROM table1
> WHERE  source_id IN (SELECT id FROM table2 WHERE customer_id = $1);

Hm.  Is this the first query executed in a session?  If so maybe it's
got something to do with populating caches and other session spin-up
overhead.

Another theory is it's some sort of locking issue.  Turning on
log_lock_waits, and setting deadlock_timeout to say 100ms, would help
in investigating that.

                        regards, tom lane

I have run it over and over with no improvement in the planning time, so I don't thing it's first in session-related.  I can only make it faster with a pl function so far.

We have log_lock_waits on and nothing shows, and turning down deadlock_timeout also doesn't do anything.

Thanks,
Jeremy

Forgot to mention, this is running on this version:
PostgreSQL 9.6.8 on x86_64-pc-linux-gnu (Ubuntu 9.6.8-1.pgdg16.04+1), compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit


Thanks,
Jeremy

Re: Slow planning time for simple query

From
Pavel Stehule
Date:


2018-06-06 18:59 GMT+02:00 Jeremy Finzel <finzelj@gmail.com>:
Hello - 

We have an odd scenario on one of our OLTP systems, which behaves the same way on a streamer, of a 700-1000ms planning time for a query like this:

SELECT *
FROM table1
WHERE  source_id IN (SELECT id FROM table2 WHERE customer_id = $1);

The actual execution time is sub-ms.

We initially thought maybe catalog bloat?  But we were able to reindex all of the highly churned catalog tables, and I even did VACUUM FULL on pg_attribute and pg_statistic, to no avail.

There are no custom settings for pg_attribute for the given tables either.

Interestingly, the problem goes away on a SAN snapshot of the target system.

Any ideas of what else we could try?  A PL function that caches the query plan works, but that is just a workaround.

more times I seen similar issue based on bloated indexes on table - pg in planning time detect min max from possible indexes

Regards

Pavel


Thanks!
Jeremy

Re: Slow planning time for simple query

From
Tom Lane
Date:
Pavel Stehule <pavel.stehule@gmail.com> writes:
> 2018-06-06 18:59 GMT+02:00 Jeremy Finzel <finzelj@gmail.com>:
>> We have an odd scenario on one of our OLTP systems, which behaves the same
>> way on a streamer, of a 700-1000ms planning time for a query like this:
>> SELECT *
>> FROM table1
>> WHERE  source_id IN (SELECT id FROM table2 WHERE customer_id = $1);

> more times I seen similar issue based on bloated indexes on table - pg in
> planning time detect min max from possible indexes

Oh, hmm, yeah it could be ye olde get_actual_variable_range() issue.
When this happens, are there perhaps a lot of recently-dead rows at either
extreme of the range of table1.source_id or table2.id?

We made a fix last year to improve that:
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=3ca930fc3
but it wasn't back-patched.

            regards, tom lane


Re: Slow planning time for simple query

From
Adrian Klaver
Date:
On 06/06/2018 09:59 AM, Jeremy Finzel wrote:
> Hello -
> 
> We have an odd scenario on one of our OLTP systems, which behaves the 
> same way on a streamer, of a 700-1000ms planning time for a query like this:
> 
> SELECT *
> FROM table1
> WHERE  source_id IN (SELECT id FROM table2 WHERE customer_id = $1);
> 
> The actual execution time is sub-ms.

I am trying to sort out the various references to times so could you:

Show the EXPLAIN ANALYZE?

and ?

psql> \timing

psql> SELECT *
FROM table1
WHERE  source_id IN (SELECT id FROM table2 WHERE customer_id = $1);


> 
> We initially thought maybe catalog bloat?  But we were able to reindex 
> all of the highly churned catalog tables, and I even did VACUUM FULL on 
> pg_attribute and pg_statistic, to no avail.
> 
> There are no custom settings for pg_attribute for the given tables either.
> 
> Interestingly, the problem goes away on a SAN snapshot of the target system.
> 
> Any ideas of what else we could try?  A PL function that caches the 
> query plan works, but that is just a workaround.
> 
> Thanks!
> Jeremy


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Slow planning time for simple query

From
Jerry Sievers
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Pavel Stehule <pavel.stehule@gmail.com> writes:
>
>> 2018-06-06 18:59 GMT+02:00 Jeremy Finzel <finzelj@gmail.com>:
>>> We have an odd scenario on one of our OLTP systems, which behaves the same
>>> way on a streamer, of a 700-1000ms planning time for a query like this:
>>> SELECT *
>>> FROM table1
>>> WHERE  source_id IN (SELECT id FROM table2 WHERE customer_id = $1);
>
>> more times I seen similar issue based on bloated indexes on table - pg in
>> planning time detect min max from possible indexes
>
> Oh, hmm, yeah it could be ye olde get_actual_variable_range() issue.
> When this happens, are there perhaps a lot of recently-dead rows at either
> extreme of the range of table1.source_id or table2.id?

We noticed the cluster of interest had a rogue physical rep slot holding
71k WAL segments.

Dropping same slot seemed to correlate with the problem going away.

Does that sound like a plausible explanation for the observed slow
planning times?

Thx



> We made a fix last year to improve that:
> https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=3ca930fc3
> but it wasn't back-patched.
>
>             regards, tom lane
>
>

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800


Re: Slow planning time for simple query

From
Tom Lane
Date:
Jerry Sievers <gsievers19@comcast.net> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Oh, hmm, yeah it could be ye olde get_actual_variable_range() issue.
>> When this happens, are there perhaps a lot of recently-dead rows at either
>> extreme of the range of table1.source_id or table2.id?

> We noticed the cluster of interest had a rogue physical rep slot holding
> 71k WAL segments.
> Dropping same slot seemed to correlate with the problem going away.
> Does that sound like a plausible explanation for the observed slow
> planning times?

I believe the slot would hold back global xmin and thereby prevent
"recently-dead" rows from becoming just plain "dead", so yeah, this
observation does seem to square with the get_actual_variable_range
theory.  You'd still need to posit that something had recently deleted
a lot of rows at the end of the range of one of those columns, though.

            regards, tom lane


Re: Slow planning time for simple query

From
Maksim Milyutin
Date:
On hot standby I faced with the similar problem.


The following simple query:

SELECT array_to_json(array_agg(t)) from (
     select *
         from     main.message m
             join main.message_instance mi on m.message_id = mi.message_id
             join main.channel_type ct on mi.channel_type_id = 
ct.channel_type_id
         where   m.user_id = 2152595
             and ct.name = 'android'
             and m.user_delete_ts is null
         order by
             case
                 when read_ts is null then 0
                 else 1
             end,
             m.message_id desc
         limit 100
     ) t;

is planned 4.940 ms on master and *254.741* ms on standby. I tried to 
investigate the reasons of so long planning on replica and determined 
that index accesses on planning stage come to multiple heap scans.


Here is the execution plan statistics of query above:

  Aggregate (actual time=0.641..0.642 rows=1 loops=1)
    Buffers: shared hit=14
    ->  Subquery Scan on t (actual time=0.337..0.360 rows=2 loops=1)
          Buffers: shared hit=14
          ->  Limit (actual time=0.292..0.301 rows=2 loops=1)
                Buffers: shared hit=14
                ->  Sort (actual time=0.287..0.291 rows=2 loops=1)
                      Sort Key: (CASE WHEN (m.read_ts IS NULL) THEN 0 
ELSE 1 END), m.message_id DESC
                      Sort Method: quicksort  Memory: 27kB
                      Buffers: shared hit=14
                      ->  Nested Loop (actual time=0.157..0.219 rows=2 
loops=1)
                            Buffers: shared hit=14
                            ->  Seq Scan on channel_type ct (actual 
time=0.043..0.048 rows=1 loops=1)
                                  Filter: (name = 'android'::text)
                                  Rows Removed by Filter: 7
                                  Buffers: shared hit=1
                            ->  Nested Loop (actual time=0.098..0.148 
rows=2 loops=1)
                                  Buffers: shared hit=13
                                  ->  Index Scan using 
message_user_id_idx1 on message m (actual time=0.055..0.063 rows=2 loops=1)
                                        Index Cond: (user_id = 2152595)
                                        Buffers: shared hit=5
                                  ->  Index Scan using 
message_instance_pkey on message_instance mi (actual time=0.026..0.029 
rows=1 loops=2)
                                        Index Cond: ((message_id = 
m.message_id) AND (channel_type_id = ct.channel_type_id))
                                        Buffers: shared hit=8

The 14 accesses to buffer cache under query execution. But 
pg_statio_user_tables and pg_statio_user_indexes views show the 
different picture:

select
     schemaname, relname, indexrelname, idx_blks_hit, idx_blks_read,
pg_size_pretty(pg_relation_size(indexrelid::regclass))
from pg_statio_user_indexes
where idx_blks_hit + idx_blks_read > 0;

  schemaname |     relname      |     indexrelname      | idx_blks_hit | 
idx_blks_read | pg_size_pretty
------------+------------------+-----------------------+--------------+---------------+----------------
  main       | channel_type     | channel_type_pkey |            2 
|             0 | 16 kB
  main       | message_instance | message_instance_pkey |          666 
|             0 | 345 MB
  main       | message          | message_pkey |           56 
|             0 | 53 MB
  main       | message          | message_user_id_idx1 |            3 
|             0 | 17 MB

select
     schemaname, relname, heap_blks_read, heap_blks_hit
from pg_statio_user_tables
where heap_blks_read + heap_blks_hit + idx_blks_hit + idx_blks_read > 0;

  schemaname |     relname      | heap_blks_read | heap_blks_hit
------------+------------------+----------------+---------------
  main       | channel_type     |              0 |             3
  main       | message          |              0 |          8682
  main       | message_instance |              0 | 114922

(This experiment is carried out on test hot standby without parallel 
activities and with cleared system statistics)


The vacuum on problem tables (main.message and main.message_instance) on 
master node resolves the problem somehow but its often execution slows 
down all queries and generally increases IO.


Is there any case to overcome the problem or it's fundamental issue and 
necessary to rewrite the query to simplify planning?


-- 
Regards,
Maksim Milyutin



Re: Slow planning time for simple query

From
Maksim Milyutin
Date:
On 09.06.2018 21:49, Maksim Milyutin wrote:

> On hot standby I faced with the similar problem.

Sorry, the problem in question is described here 
https://www.postgresql.org/message-id/22136.1528312205%40sss.pgh.pa.us

-- 
Regards,
Maksim Milyutin



Re: Slow planning time for simple query

From
Tom Lane
Date:
Maksim Milyutin <milyutinma@gmail.com> writes:
> On hot standby I faced with the similar problem.
> ...
> is planned 4.940 ms on master and *254.741* ms on standby.

Presumably the problem is that the standby isn't authorized to change
the btree index's "entry is dead" bits, so it's unable to prune index
entries previously detected as dead, and thus the logic that intends
to improve this situation doesn't work on the standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

I wonder if we should extend the "SnapshotNonVacuumable" logic introduced
in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed
non vacuumable.  This would essentially get rid of long standby planning
times in this sort of scenario by instead accepting worse (possibly much
worse) planner range estimates.  I'm unsure if that's a good tradeoff or
not.

            regards, tom lane


Re: Slow planning time for simple query

From
Tom Lane
Date:
Maksim Milyutin <milyutinma@gmail.com> writes:
> On hot standby I faced with the similar problem.
> ...
> is planned 4.940 ms on master and *254.741* ms on standby.

Presumably the problem is that the standby isn't authorized to change
the btree index's "entry is dead" bits, so it's unable to prune index
entries previously detected as dead, and thus the logic that intends
to improve this situation doesn't work on the standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

I wonder if we should extend the "SnapshotNonVacuumable" logic introduced
in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed
non vacuumable.  This would essentially get rid of long standby planning
times in this sort of scenario by instead accepting worse (possibly much
worse) planner range estimates.  I'm unsure if that's a good tradeoff or
not.

            regards, tom lane


Re: Slow planning time for simple query

From
Maksim Milyutin
Date:

On 09.06.2018 22:49, Tom Lane wrote:

Maksim Milyutin <milyutinma@gmail.com> writes:
On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

I have verified the number dead item pointers (through pageinspect extension) in the first leaf page of index participating in query ('main.message_instance_pkey') on master and slave nodes and have noticed a big difference.

SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);

On master:

 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
  3705 | l    |          1 |         58 |            24 |      8192 |      6496 |         0 |      3719 |    0 |         65

On standby:

 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
  3705 | l    |         59 |          0 |            24 |      8192 |      6496 |         0 |      3719 |    0 |          1


The vacuum routine improves the situation.
Сan there be something that I have incorrectly configured WAL logging or replication?


I wonder if we should extend the "SnapshotNonVacuumable" logic introduced
in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed
non vacuumable.  This would essentially get rid of long standby planning
times in this sort of scenario by instead accepting worse (possibly much
worse) planner range estimates.  I'm unsure if that's a good tradeoff or
not.

I applied the patch introduced in this commit to test standby (not master; I don't know if this is correct) and haven't noticed any differences.

-- 
Regards,
Maksim Milyutin

Re: Slow planning time for simple query

From
Maksim Milyutin
Date:

On 09.06.2018 22:49, Tom Lane wrote:

Maksim Milyutin <milyutinma@gmail.com> writes:
On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

I have verified the number dead item pointers (through pageinspect extension) in the first leaf page of index participating in query ('main.message_instance_pkey') on master and slave nodes and have noticed a big difference.

SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);

On master:

 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
  3705 | l    |          1 |         58 |            24 |      8192 |      6496 |         0 |      3719 |    0 |         65

On standby:

 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
  3705 | l    |         59 |          0 |            24 |      8192 |      6496 |         0 |      3719 |    0 |          1


The vacuum routine improves the situation.
Сan there be something that I have incorrectly configured WAL logging or replication?


I wonder if we should extend the "SnapshotNonVacuumable" logic introduced
in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed
non vacuumable.  This would essentially get rid of long standby planning
times in this sort of scenario by instead accepting worse (possibly much
worse) planner range estimates.  I'm unsure if that's a good tradeoff or
not.

I applied the patch introduced in this commit to test standby (not master; I don't know if this is correct) and haven't noticed any differences.

-- 
Regards,
Maksim Milyutin

Re: Slow planning time for simple query

From
Maksim Milyutin
Date:

13.06.2018 12:40, Maksim Milyutin wrote:

On 09.06.2018 22:49, Tom Lane wrote:

Maksim Milyutin <milyutinma@gmail.com> writes:
On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.
(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

I have verified the number dead item pointers (through pageinspect extension) in the first leaf page of index participating in query ('main.message_instance_pkey') on master and slave nodes and have noticed a big difference.

SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);

On master:

 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
  3705 | l    |          1 |         58 |            24 |      8192 |      6496 |         0 |      3719 |    0 |         65

On standby:

 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
  3705 | l    |         59 |          0 |            24 |      8192 |      6496 |         0 |      3719 |    0 |          1



In this point I want to highlight the issue that the changes in lp_flags bits (namely, set items as dead) for index item pointers doesn't propagate from master to replica in my case. As a consequence, on standby I have live index items most of which on master are marked as dead. And my queries on planning stage are forced to descent to heap pages under get_actual_variable_range execution that considerately slows down planning.

Is it bug or restriction of implementation or misconfiguration of WAL/replication?

-- 
Regards,
Maksim Milyutin

Re: Slow planning time for simple query

From
Maksim Milyutin
Date:

13.06.2018 12:40, Maksim Milyutin wrote:

On 09.06.2018 22:49, Tom Lane wrote:

Maksim Milyutin <milyutinma@gmail.com> writes:
On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.
(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

I have verified the number dead item pointers (through pageinspect extension) in the first leaf page of index participating in query ('main.message_instance_pkey') on master and slave nodes and have noticed a big difference.

SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);

On master:

 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
  3705 | l    |          1 |         58 |            24 |      8192 |      6496 |         0 |      3719 |    0 |         65

On standby:

 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
  3705 | l    |         59 |          0 |            24 |      8192 |      6496 |         0 |      3719 |    0 |          1



In this point I want to highlight the issue that the changes in lp_flags bits (namely, set items as dead) for index item pointers doesn't propagate from master to replica in my case. As a consequence, on standby I have live index items most of which on master are marked as dead. And my queries on planning stage are forced to descent to heap pages under get_actual_variable_range execution that considerately slows down planning.

Is it bug or restriction of implementation or misconfiguration of WAL/replication?

-- 
Regards,
Maksim Milyutin

Re: Slow planning time for simple query

From
Amit Kapila
Date:
On Sun, Jun 10, 2018 at 1:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maksim Milyutin <milyutinma@gmail.com> writes:
>> On hot standby I faced with the similar problem.
>> ...
>> is planned 4.940 ms on master and *254.741* ms on standby.
>
> Presumably the problem is that the standby isn't authorized to change
> the btree index's "entry is dead" bits,
>

I don't see anything like that in the code.  We use _bt_killitems to
mark the items as dead and neither that function or any of its caller
has any such assumption.

> so it's unable to prune index
> entries previously detected as dead, and thus the logic that intends
> to improve this situation doesn't work on the standby.
>

If my above understanding is correct, then one thing that could lead
to such behavior is the computation of RecentGlobalXmin on standby.
Basically, if the RecentGlobalXmin has a different value on standby,
then it is possible that the decision whether a particular item is
dead differs on master and standby.

> (I wonder though why, if you executed the same query on the master,
> its setting of the index-entry-is-dead bits didn't propagate to the
> standby.)
>

Because we don't WAL log it.  See _bt_killitems.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Slow planning time for simple query

From
Amit Kapila
Date:
On Sun, Jun 10, 2018 at 1:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maksim Milyutin <milyutinma@gmail.com> writes:
>> On hot standby I faced with the similar problem.
>> ...
>> is planned 4.940 ms on master and *254.741* ms on standby.
>
> Presumably the problem is that the standby isn't authorized to change
> the btree index's "entry is dead" bits,
>

I don't see anything like that in the code.  We use _bt_killitems to
mark the items as dead and neither that function or any of its caller
has any such assumption.

> so it's unable to prune index
> entries previously detected as dead, and thus the logic that intends
> to improve this situation doesn't work on the standby.
>

If my above understanding is correct, then one thing that could lead
to such behavior is the computation of RecentGlobalXmin on standby.
Basically, if the RecentGlobalXmin has a different value on standby,
then it is possible that the decision whether a particular item is
dead differs on master and standby.

> (I wonder though why, if you executed the same query on the master,
> its setting of the index-entry-is-dead bits didn't propagate to the
> standby.)
>

Because we don't WAL log it.  See _bt_killitems.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Slow planning time for simple query

From
Amit Kapila
Date:
On Thu, Jun 14, 2018 at 4:34 AM, Maksim Milyutin <milyutinma@gmail.com> wrote:
> 13.06.2018 12:40, Maksim Milyutin wrote:
>
> On 09.06.2018 22:49, Tom Lane wrote:
>
> Maksim Milyutin <milyutinma@gmail.com> writes:
>
> On hot standby I faced with the similar problem.
> ...
> is planned 4.940 ms on master and *254.741* ms on standby.
>
> (I wonder though why, if you executed the same query on the master,
> its setting of the index-entry-is-dead bits didn't propagate to the
> standby.)
>
>
> I have verified the number dead item pointers (through pageinspect
> extension) in the first leaf page of index participating in query
> ('main.message_instance_pkey') on master and slave nodes and have noticed a
> big difference.
>
> SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);
>
> On master:
>
>  blkno | type | live_items | dead_items | avg_item_size | page_size |
> free_size | btpo_prev | btpo_next | btpo | btpo_flags
>
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
>   3705 | l    |          1 |         58 |            24 |      8192 |
> 6496 |         0 |      3719 |    0 |         65
>
> On standby:
>
>  blkno | type | live_items | dead_items | avg_item_size | page_size |
> free_size | btpo_prev | btpo_next | btpo | btpo_flags
>
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
>   3705 | l    |         59 |          0 |            24 |      8192 |
> 6496 |         0 |      3719 |    0 |          1
>
>
>
> In this point I want to highlight the issue that the changes in lp_flags
> bits (namely, set items as dead) for index item pointers doesn't propagate
> from master to replica in my case. As a consequence, on standby I have live
> index items most of which on master are marked as dead. And my queries on
> planning stage are forced to descent to heap pages under
> get_actual_variable_range execution that considerately slows down planning.
>
> Is it bug or restriction of implementation or misconfiguration of
> WAL/replication?
>

It is not a misconfiguration issue.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Slow planning time for simple query

From
Amit Kapila
Date:
On Thu, Jun 14, 2018 at 4:34 AM, Maksim Milyutin <milyutinma@gmail.com> wrote:
> 13.06.2018 12:40, Maksim Milyutin wrote:
>
> On 09.06.2018 22:49, Tom Lane wrote:
>
> Maksim Milyutin <milyutinma@gmail.com> writes:
>
> On hot standby I faced with the similar problem.
> ...
> is planned 4.940 ms on master and *254.741* ms on standby.
>
> (I wonder though why, if you executed the same query on the master,
> its setting of the index-entry-is-dead bits didn't propagate to the
> standby.)
>
>
> I have verified the number dead item pointers (through pageinspect
> extension) in the first leaf page of index participating in query
> ('main.message_instance_pkey') on master and slave nodes and have noticed a
> big difference.
>
> SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);
>
> On master:
>
>  blkno | type | live_items | dead_items | avg_item_size | page_size |
> free_size | btpo_prev | btpo_next | btpo | btpo_flags
>
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
>   3705 | l    |          1 |         58 |            24 |      8192 |
> 6496 |         0 |      3719 |    0 |         65
>
> On standby:
>
>  blkno | type | live_items | dead_items | avg_item_size | page_size |
> free_size | btpo_prev | btpo_next | btpo | btpo_flags
>
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
>   3705 | l    |         59 |          0 |            24 |      8192 |
> 6496 |         0 |      3719 |    0 |          1
>
>
>
> In this point I want to highlight the issue that the changes in lp_flags
> bits (namely, set items as dead) for index item pointers doesn't propagate
> from master to replica in my case. As a consequence, on standby I have live
> index items most of which on master are marked as dead. And my queries on
> planning stage are forced to descent to heap pages under
> get_actual_variable_range execution that considerately slows down planning.
>
> Is it bug or restriction of implementation or misconfiguration of
> WAL/replication?
>

It is not a misconfiguration issue.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Slow planning time for simple query

From
Andrew Gierth
Date:
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

 >> Presumably the problem is that the standby isn't authorized to change
 >> the btree index's "entry is dead" bits,

 Amit> I don't see anything like that in the code. We use _bt_killitems
 Amit> to mark the items as dead and neither that function or any of its
 Amit> caller has any such assumption.

See index_fetch_heap:

    /*
     * If we scanned a whole HOT chain and found only dead tuples, tell index
     * AM to kill its entry for that TID (this will take effect in the next
     * amgettuple call, in index_getnext_tid).  We do not do this when in
     * recovery because it may violate MVCC to do so.  See comments in
     * RelationGetIndexScan().
     */
    if (!scan->xactStartedInRecovery)
        scan->kill_prior_tuple = all_dead;

(this is the only place where kill_prior_tuple can be set to true)

-- 
Andrew (irc:RhodiumToad)


Re: Slow planning time for simple query

From
Andrew Gierth
Date:
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

 >> Presumably the problem is that the standby isn't authorized to change
 >> the btree index's "entry is dead" bits,

 Amit> I don't see anything like that in the code. We use _bt_killitems
 Amit> to mark the items as dead and neither that function or any of its
 Amit> caller has any such assumption.

See index_fetch_heap:

    /*
     * If we scanned a whole HOT chain and found only dead tuples, tell index
     * AM to kill its entry for that TID (this will take effect in the next
     * amgettuple call, in index_getnext_tid).  We do not do this when in
     * recovery because it may violate MVCC to do so.  See comments in
     * RelationGetIndexScan().
     */
    if (!scan->xactStartedInRecovery)
        scan->kill_prior_tuple = all_dead;

(this is the only place where kill_prior_tuple can be set to true)

-- 
Andrew (irc:RhodiumToad)