Thread: Re: Slow planning time for simple query

Re: Slow planning time for simple query

From
Tom Lane
Date:
[ trimming cc's ]

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> See index_fetch_heap:
>     /*
>      * If we scanned a whole HOT chain and found only dead tuples, tell index
>      * AM to kill its entry for that TID (this will take effect in the next
>      * amgettuple call, in index_getnext_tid).  We do not do this when in
>      * recovery because it may violate MVCC to do so.  See comments in
>      * RelationGetIndexScan().
>      */
>     if (!scan->xactStartedInRecovery)
>         scan->kill_prior_tuple = all_dead;

So the referenced comment in RelationGetIndexScan says

     * During recovery we ignore killed tuples and don't bother to kill them
     * either. We do this because the xmin on the primary node could easily be
     * later than the xmin on the standby node, so that what the primary
     * thinks is killed is supposed to be visible on standby. So for correct
     * MVCC for queries during recovery we must ignore these hints and check
     * all tuples. Do *not* set ignore_killed_tuples to true when running in a
     * transaction that was started during recovery. xactStartedInRecovery
     * should not be altered by index AMs.

but it seems to me that this is not terribly carefully thought through.

1. If global xmin on the primary is later than on the standby, VACUUM could
remove tuples that should be visible on the standby, and that would shortly
propagate to the standby.  Simply ignoring index kill bits on the standby
won't prevent that.

2. Although _bt_killitems doesn't WAL-log its setting of kill bits, those
bits could propagate to the standby anyway, as a result of a subsequent
WAL action on the index page that gets a full-page image added.

I believe that in some replication modes, #1 isn't a problem because we
have mechanisms to hold back the primary's global xmin.  If that's true
always, then all of this logic is obsolete silliness that we could remove.
If it's not true always, we have bigger problems to fix.

Similarly, the possibility of #2 means the standby can't trust the kill
bits to reflect its own xmin anyway.  So if we aren't junking this entire
stack of logic, we could perhaps allow the standby to set kill bits per
its own understanding of global xmin, and then allow the kill bits to be
used in SnapshotNonVacuumable scans even though they're unsafe for regular
MVCC scans.

However, I'm still caffeine-short so maybe there's something I missed.

            regards, tom lane


Re: Slow planning time for simple query

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 Tom>      * During recovery we ignore killed tuples and don't bother to kill them
 Tom>      * either. We do this because the xmin on the primary node could easily be
 Tom>      * later than the xmin on the standby node, so that what the primary
 Tom>      * thinks is killed is supposed to be visible on standby. So for correct
 Tom>      * MVCC for queries during recovery we must ignore these hints and check
 Tom>      * all tuples. Do *not* set ignore_killed_tuples to true when running in a
 Tom>      * transaction that was started during recovery. xactStartedInRecovery
 Tom>      * should not be altered by index AMs.

 Tom> but it seems to me that this is not terribly carefully thought through.

 Tom> 1. If global xmin on the primary is later than on the standby,
 Tom> VACUUM could remove tuples that should be visible on the standby,
 Tom> and that would shortly propagate to the standby. Simply ignoring
 Tom> index kill bits on the standby won't prevent that.

Right, but we have conflict detection for vacuum's tuple removal
actions, which we don't have for the index hints.

 Tom> 2. Although _bt_killitems doesn't WAL-log its setting of kill
 Tom> bits, those bits could propagate to the standby anyway, as a
 Tom> result of a subsequent WAL action on the index page that gets a
 Tom> full-page image added.

That's OK as long as we're ignoring those hints on the standby.

 Tom> I believe that in some replication modes, #1 isn't a problem
 Tom> because we have mechanisms to hold back the primary's global xmin.

That's the case if feedback is enabled, but not if it's disabled, which
is sometimes done intentionally to ensure that long-running queries on
the standby don't hold back the master's xmin horizon.

Conflict detection then comes into play to kill the aforesaid
long-running queries before vacuuming away anything they might see.

-- 
Andrew (irc:RhodiumToad)


Re: Slow planning time for simple query

From
Amit Kapila
Date:
On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>  Tom> 2. Although _bt_killitems doesn't WAL-log its setting of kill
>  Tom> bits, those bits could propagate to the standby anyway, as a
>  Tom> result of a subsequent WAL action on the index page that gets a
>  Tom> full-page image added.
>
> That's OK as long as we're ignoring those hints on the standby.
>

What if we don't ignore those hints on standby for a specific case
like the one in get_actual_variable_range?  Now, if the user has
enabled wal_log_hints on the master, it could save time from scanning
many dead tuples on standby.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Slow planning time for simple query

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth
> <andrew@tao11.riddles.org.uk> wrote:
>> That's OK as long as we're ignoring those hints on the standby.

> What if we don't ignore those hints on standby for a specific case
> like the one in get_actual_variable_range?

Yeah, that's the same idea I suggested upthread.  Andrew shot down
my first thought (correctly I think) but the second one still seems
feasible.

> Now, if the user has
> enabled wal_log_hints on the master, it could save time from scanning
> many dead tuples on standby.

We should have the standby set the hint bits for itself, rather than
relying on wal_log_hints.

            regards, tom lane


Re: Slow planning time for simple query

From
Amit Kapila
Date:
On Mon, Jun 18, 2018 at 7:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth
>> <andrew@tao11.riddles.org.uk> wrote:
>>> That's OK as long as we're ignoring those hints on the standby.
>
>> What if we don't ignore those hints on standby for a specific case
>> like the one in get_actual_variable_range?
>
> Yeah, that's the same idea I suggested upthread.  Andrew shot down
> my first thought (correctly I think) but the second one still seems
> feasible.
>
>> Now, if the user has
>> enabled wal_log_hints on the master, it could save time from scanning
>> many dead tuples on standby.
>
> We should have the standby set the hint bits for itself, rather than
> relying on wal_log_hints.
>

Yeah, I think that is a good idea unless someone sees a hole in it,
but we have to keep in mind that the bits set by standby can be
overwritten later for instance (a) due to an FPI for a change on that
page in master and (b)  when that page is evicted from shared buffers
because we don't allow to dirty the page during recovery.  However, I
think the same is true for regular hint bits (hint bit on a tuple that
suggests transaction is committed).  On the one hand it will be good
if we can do it as a single patch, but it is not difficult to imagine
that we can do this as two separate patches where the first patch can
just introduce an idea to not ignore kill bits on standby in a
particular case (get_actual_variable_range) and the second patch to
introduce something like StandbyGlobalXmin which will then be used in
get_actual_variable_range.  Both are independently useful.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com