Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
Date
Msg-id f5c07663-3846-cc07-3f79-620db4eeff24@postgrespro.ru
Whole thread Raw
In response to Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers

On 11.01.2018 22:41, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
>>
>> Currently, limiting the amount of memory consumed by logical decoding is
>> tricky (or you might say impossible) for several reasons:
> I would like to see some more discussion on this, but I think not a lot
> of people understand the details, so I'll try to write up an explanation
> here.  This code is also somewhat new to me, so please correct me if
> there are inaccuracies, while keeping in mind that I'm trying to simplify.
>
> The data in the WAL is written as it happens, so the changes belonging
> to different transactions are all mixed together.  One of the jobs of
> logical decoding is to reassemble the changes belonging to each
> transaction.  The top-level data structure for that is the infamous
> ReorderBuffer.  So as it reads the WAL and sees something about a
> transaction, it keeps a copy of that change in memory, indexed by
> transaction ID (ReorderBufferChange).  When the transaction commits, the
> accumulated changes are passed to the output plugin and then freed.  If
> the transaction aborts, then changes are just thrown away.
>
> So when logical decoding is active, a copy of the changes for each
> active transaction is kept in memory (once per walsender).
>
> More precisely, the above happens for each subtransaction.  When the
> top-level transaction commits, it finds all its subtransactions in the
> ReorderBuffer, reassembles everything in the right order, then invokes
> the output plugin.
>
> All this could end up using an unbounded amount of memory, so there is a
> mechanism to spill changes to disk.  The way this currently works is
> hardcoded, and this patch proposes to change that.
>
> Currently, when a transaction or subtransaction has accumulated 4096
> changes, it is spilled to disk.  When the top-level transaction commits,
> things are read back from disk to do the final processing mentioned above.
>
> This all works mostly fine, but you can construct some more extreme
> cases where this can blow up.
>
> Here is a mundane example.  Let's say a change entry takes 100 bytes (it
> might contain a new row, or an update key and some new column values,
> for example).  If you have 100 concurrent active sessions and no
> subtransactions, then logical decoding memory is bounded by 4096 * 100 *
> 100 = 40 MB (per walsender) before things spill to disk.
>
> Now let's say you are using a lot of subtransactions, because you are
> using PL functions, exception handling, triggers, doing batch updates.
> If you have 200 subtransactions on average per concurrent session, the
> memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
> (per walsender).  And so on.  If you have more concurrent sessions or
> larger changes or more subtransactions, you'll use much more than those
> 8 GB.  And if you don't have those 8 GB, then you're stuck at this point.
>
> That is the consideration when we record changes, but we also need
> memory when we do the final processing at commit time.  That is slightly
> less problematic because we only process one top-level transaction at a
> time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
> (without the concurrent sessions factor).
>
> So, this patch proposes to improve this as follows:
>
> - We compute the actual size of each ReorderBufferChange and keep a
> running tally for each transaction, instead of just counting the number
> of changes.
>
> - We have a configuration setting that allows us to change the limit
> instead of the hardcoded 4096.  The configuration setting is also in
> terms of memory, not in number of changes.
>
> - The configuration setting is for the total memory usage per decoding
> session, not per subtransaction.  (So we also keep a running tally for
> the entire ReorderBuffer.)
>
> There are two open issues with this patch:
>
> One, this mechanism only applies when recording changes.  The processing
> at commit time still uses the previous hardcoded mechanism.  The reason
> for this is, AFAIU, that as things currently work, you have to have all
> subtransactions in memory to do the final processing.  There are some
> proposals to change this as well, but they are more involved.  Arguably,
> per my explanation above, memory use at commit time is less likely to be
> a problem.
>
> Two, what to do when the memory limit is reached.  With the old
> accounting, this was easy, because we'd decide for each subtransaction
> independently whether to spill it to disk, when it has reached its 4096
> limit.  Now, we are looking at a global limit, so we have to find a
> transaction to spill in some other way.  The proposed patch searches
> through the entire list of transactions to find the largest one.  But as
> the patch says:
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."
>
> (a) would create more overhead for the case where everything fits into
> memory, so it seems unattractive.  Some combination of (b) and (c) seems
> useful, but we'd have to come up with something concrete.
>
> Thoughts?
>

I am very sorry that I have not noticed this thread before.
Spilling to the file in reorder buffer is the main factor limiting speed 
of importing data in multimaster and shardman (sharding based on FDW 
with redundancy provided by LR).
This is why we think a lot about possible ways of addressing this issue.
Right now data of huge transaction is written to the disk three times 
before it is applied at replica. And obviously read also three times.
First it is saved in WAL, then spilled to the disk by reorder buffer and 
once again spilled to the disk at replica before assignment to the 
particular apply worker
(last one is specific of multimaster, which can apply received 
transactions concurrently).

We considered three different approaches:
1. Streaming. It is similar with the proposed patch, the main difference 
is that we do not want to spill transaction in temporary file at 
replica, but apply it immediately in separate backend and abort 
transaction if it is aborted at master. Certainly it will work only with 
2PC.
2. Elimination of spilling by rescanning WAL.
3. Bypass WAL: add hooks to heapam to buffer and propagate changes 
immediately to replica and apply them in dedicated backend.
I have implemented prototype of such replication. With one replica it 
shows about 1.5x slowdown comparing with standalone/async LR and about 
2-3 improvement comparing with sync LR. For two replicas result is 2x 
slower than async LR and 2-8 times faster than sync LR (depending on 
number of concurrent connections).

Approach 3) seems to be specific to multimaster/shardman, so most likely 
it can not be considered for general LR.
So I want to compare 1 and 2. Did you ever though about something like 2?

Right now in the proposed patch you just move spilling to the file from 
master to replica.
It still can make sense to avoid memory overflow and reduce disk IO at 
master.
But if we have just one huge transaction (COPY) importing gigabytes of 
data to the database,
then performance will be almost the same with your patch or without it.
The only difference is where we serialize transaction: at master or at 
replica side.
In this sense this patch doesn't solve the problem with slow load of 
large bulks of data though LR.

Alternatively (approach 2) we can have small in-memory buffer for 
decoding transaction and remember LSN and snapshot of this transaction 
start.
In case of buffer overflow we just continue WAL traversal until we reach
end of the transaction. After it we restart scanning WAL from the 
beginning of this transaction at this second pass
send changes directly to the output plugin. So we have to scan WAL 
several times but do not need to spill anything to the disk neither at 
publisher, neither at subscriber side.
Certainly this approach will be inefficient if we have several long 
interleaving transactions. But in most customer's use cases we have 
observed until now there is just one huge transaction performing bulk load.
May be I missed something, but this approach seems to be easier for 
implementation than transaction streaming. And it doesn't require any 
changes in output plugin API.
I realize that it is a little bit late to ask this question once your 
patch is almost ready, but what do you think about it? Are there some 
pitfals with this approach?

There is one more aspect and performance problem with LR we have faced 
with shardman: if there are several publications for different subsets 
of table at one instance,
then WAL senders have to do a lot of useless work. Them are decoding 
transactions which have no relation to this publication. But WAL sender 
doesn't know it until it reaches the end of this transaction. What is 
worser: if transaction is huge, then all WAL senders will spill it to 
the disk even through only one of them actually needs it. So data of 
huge transaction is written not three times, but N times, where N is 
number of publications. The only solution of the problem we can imagine 
is to let backend somehow inform WAL sender (through shared message queue?)
about LSN-s it should considered. In this case WAL sender can skip large 
portions of WAL without decoding. We also want to know opinion of 
2ndQuandarnt about this idea.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: SET TRANSACTION in PL/pgSQL
Next
From: Tom Lane
Date:
Subject: Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()