Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
Date
Msg-id 20190829184824.kmrbchrk2ged6vjw@development
Whole thread Raw
In response to Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions  (Alexey Kondratov <a.kondratov@postgrespro.ru>)
Responses Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
List pgsql-hackers
On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote:
>On 28.08.2019 22:06, Tomas Vondra wrote:
>>
>>>
>>>>>>>Interesting. Any idea where does the extra overhead in 
>>>>>>>this particular
>>>>>>>case come from? It's hard to deduce that from the single 
>>>>>>>flame graph,
>>>>>>>when I don't have anything to compare it with (i.e. the 
>>>>>>>flame graph for
>>>>>>>the "normal" case).
>>>>>>I guess that bottleneck is in disk operations. You can check
>>>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>>>writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>>>please, see attached flame graph for the following transaction:
>>>>>>
>>>>>>INSERT INTO large_text
>>>>>>SELECT (SELECT string_agg('x', ',')
>>>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>>
>>>>>>Execution Time: 44519.816 ms
>>>>>>Time: 98333,642 ms (01:38,334)
>>>>>>
>>>>>>where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>>>~x4-5 performance drop here. JFYI, I am using a machine with 
>>>>>>SSD for tests.
>>>>>>
>>>>>>Therefore, probably you may write changes on receiver in 
>>>>>>bigger chunks,
>>>>>>not each change separately.
>>>>>>
>>>>>Possibly, I/O is certainly a possible culprit, although we should be
>>>>>using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>>>not sure why would it be cheaper to do the writes in batches.
>>>>>
>>>>>BTW does this mean you see the overhead on the apply side? Or are you
>>>>>running this on a single machine, and it's difficult to decide?
>>>>
>>>>I run this on a single machine, but walsender and worker are 
>>>>utilizing almost 100% of CPU per each process all the time, and 
>>>>at apply side I/O syscalls take about 1/3 of CPU time. Though I 
>>>>am still not sure, but for me this result somehow links 
>>>>performance drop with problems at receiver side.
>>>>
>>>>Writing in batches was just a hypothesis and to validate it I 
>>>>have performed test with large txn, but consisting of a smaller 
>>>>number of wide rows. This test does not exhibit any significant 
>>>>performance drop, while it was streamed too. So it seems to be 
>>>>valid. Anyway, I do not have other reasonable ideas beside that 
>>>>right now.
>>>
>>>It seems that overhead added by synchronous replica is lower by 
>>>2-3 times compared with Postgres master and streaming with 
>>>spilling. Therefore, the original patch eliminated delay before 
>>>large transaction processing start by sender, while this 
>>>additional patch speeds up the applier side.
>>>
>>>Although the overall speed up is surely measurable, there is a 
>>>room for improvements yet:
>>>
>>>1) Currently bgworkers are only spawned on demand without some 
>>>initial pool and never stopped. Maybe we should create a small 
>>>pool on replication start and offload some of idle bgworkers if 
>>>they exceed some limit?
>>>
>>>2) Probably we can track somehow that incoming change has 
>>>conflicts with some of being processed xacts, so we can wait for 
>>>specific bgworkers only in that case?
>>>
>>>3) Since the communication between main logical apply worker and 
>>>each bgworker from the pool is a 'single producer --- single 
>>>consumer' problem, then probably it is possible to wait and 
>>>set/check flags without locks, but using just atomics.
>>>
>>>What do you think about this concept in general? Any concerns and 
>>>criticism are welcome!
>>>
>>
>
>Hi Tomas,
>
>Thank you for a quick response.
>
>>I don't think it matters very much whether the workers are started at the
>>beginning or allocated ad hoc, that's IMO a minor implementation detail.
>
>OK, I had the same vision about this point. Any minor differences here 
>will be neglectable for a sufficiently large transaction.
>
>>
>>There's one huge challenge that I however don't see mentioned in your
>>message or in the patch (after cursory reading) - ensuring the same 
>>commit
>>order, and introducing deadlocks that would not exist in single-process
>>apply.
>
>Probably I haven't explained well this part, sorry for that. In my 
>patch I don't use workers pool for a concurrent transaction apply, but 
>rather for a fast context switch between long-lived streamed 
>transactions. In other words we apply all changes arrived from the 
>sender in a completely serial manner. Being written step-by-step it 
>looks like:
>
>1) Read STREAM START message and figure out the target worker by xid.
>
>2) Put all changes, which belongs to this xact to the selected worker 
>one by one via shm_mq_send.
>
>3) Read STREAM STOP message and wait until our worker will apply all 
>changes in the queue.
>
>4) Process all other chunks of streamed xacts in the same manner.
>
>5) Process all non-streamed xacts immediately in the main apply worker loop.
>
>6) If we read STREAMED COMMIT/ABORT we again wait until selected 
>worker either commits or aborts.
>
>Thus, it automatically guaranties the same commit order on replica as 
>on master. Yes, we loose some performance here, since we don't apply 
>transactions concurrently, but it would bring all those problems you 
>have described.
>

OK, so it's apply in multiple processes, but at any moment only a single
apply process is active. 

>However, you helped me to figure out another point I have forgotten. 
>Although we ensure commit order automatically, the beginning of 
>streamed xacts may reorder. It happens if some small xacts have been 
>commited on master since the streamed one started, because we do not 
>start streaming immediately, but only after logical_work_mem hit. I 
>have performed some tests with conflicting xacts and it seems that 
>it's not a problem, since locking mechanism in Postgres guarantees 
>that if there would some deadlocks, they will happen earlier on 
>master. So if some records hit the WAL, it is safe to apply the 
>sequentially. Am I wrong?
>

I think you're right the way you interleave the changes ensures you
can't introduce new deadlocks between transactions in this stream. I don't
think reordering the blocks of streamed trasactions does matter, as long
as the comit order is ensured in this case.

>Anyway, I'm going to double check the safety of this part later.
>

OK.

FWIW my understanding is that the speedup comes mostly from elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large
subtransactions).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: A problem about partitionwise join
Next
From: Peter Eisentraut
Date:
Subject: Re: pg_upgrade: Error out on too many command-line arguments