Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions |
Date | |
Msg-id | 20190829184824.kmrbchrk2ged6vjw@development Whole thread Raw |
In response to | Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions (Alexey Kondratov <a.kondratov@postgrespro.ru>) |
Responses |
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
|
List | pgsql-hackers |
On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote: >On 28.08.2019 22:06, Tomas Vondra wrote: >> >>> >>>>>>>Interesting. Any idea where does the extra overhead in >>>>>>>this particular >>>>>>>case come from? It's hard to deduce that from the single >>>>>>>flame graph, >>>>>>>when I don't have anything to compare it with (i.e. the >>>>>>>flame graph for >>>>>>>the "normal" case). >>>>>>I guess that bottleneck is in disk operations. You can check >>>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and >>>>>>writes (~26%) take around 35% of CPU time in summary. To compare, >>>>>>please, see attached flame graph for the following transaction: >>>>>> >>>>>>INSERT INTO large_text >>>>>>SELECT (SELECT string_agg('x', ',') >>>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000); >>>>>> >>>>>>Execution Time: 44519.816 ms >>>>>>Time: 98333,642 ms (01:38,334) >>>>>> >>>>>>where disk IO is only ~7-8% in total. So we get very roughly the same >>>>>>~x4-5 performance drop here. JFYI, I am using a machine with >>>>>>SSD for tests. >>>>>> >>>>>>Therefore, probably you may write changes on receiver in >>>>>>bigger chunks, >>>>>>not each change separately. >>>>>> >>>>>Possibly, I/O is certainly a possible culprit, although we should be >>>>>using buffered I/O and there certainly are not any fsyncs here. So I'm >>>>>not sure why would it be cheaper to do the writes in batches. >>>>> >>>>>BTW does this mean you see the overhead on the apply side? Or are you >>>>>running this on a single machine, and it's difficult to decide? >>>> >>>>I run this on a single machine, but walsender and worker are >>>>utilizing almost 100% of CPU per each process all the time, and >>>>at apply side I/O syscalls take about 1/3 of CPU time. Though I >>>>am still not sure, but for me this result somehow links >>>>performance drop with problems at receiver side. >>>> >>>>Writing in batches was just a hypothesis and to validate it I >>>>have performed test with large txn, but consisting of a smaller >>>>number of wide rows. This test does not exhibit any significant >>>>performance drop, while it was streamed too. So it seems to be >>>>valid. Anyway, I do not have other reasonable ideas beside that >>>>right now. >>> >>>It seems that overhead added by synchronous replica is lower by >>>2-3 times compared with Postgres master and streaming with >>>spilling. Therefore, the original patch eliminated delay before >>>large transaction processing start by sender, while this >>>additional patch speeds up the applier side. >>> >>>Although the overall speed up is surely measurable, there is a >>>room for improvements yet: >>> >>>1) Currently bgworkers are only spawned on demand without some >>>initial pool and never stopped. Maybe we should create a small >>>pool on replication start and offload some of idle bgworkers if >>>they exceed some limit? >>> >>>2) Probably we can track somehow that incoming change has >>>conflicts with some of being processed xacts, so we can wait for >>>specific bgworkers only in that case? >>> >>>3) Since the communication between main logical apply worker and >>>each bgworker from the pool is a 'single producer --- single >>>consumer' problem, then probably it is possible to wait and >>>set/check flags without locks, but using just atomics. >>> >>>What do you think about this concept in general? Any concerns and >>>criticism are welcome! >>> >> > >Hi Tomas, > >Thank you for a quick response. > >>I don't think it matters very much whether the workers are started at the >>beginning or allocated ad hoc, that's IMO a minor implementation detail. > >OK, I had the same vision about this point. Any minor differences here >will be neglectable for a sufficiently large transaction. > >> >>There's one huge challenge that I however don't see mentioned in your >>message or in the patch (after cursory reading) - ensuring the same >>commit >>order, and introducing deadlocks that would not exist in single-process >>apply. > >Probably I haven't explained well this part, sorry for that. In my >patch I don't use workers pool for a concurrent transaction apply, but >rather for a fast context switch between long-lived streamed >transactions. In other words we apply all changes arrived from the >sender in a completely serial manner. Being written step-by-step it >looks like: > >1) Read STREAM START message and figure out the target worker by xid. > >2) Put all changes, which belongs to this xact to the selected worker >one by one via shm_mq_send. > >3) Read STREAM STOP message and wait until our worker will apply all >changes in the queue. > >4) Process all other chunks of streamed xacts in the same manner. > >5) Process all non-streamed xacts immediately in the main apply worker loop. > >6) If we read STREAMED COMMIT/ABORT we again wait until selected >worker either commits or aborts. > >Thus, it automatically guaranties the same commit order on replica as >on master. Yes, we loose some performance here, since we don't apply >transactions concurrently, but it would bring all those problems you >have described. > OK, so it's apply in multiple processes, but at any moment only a single apply process is active. >However, you helped me to figure out another point I have forgotten. >Although we ensure commit order automatically, the beginning of >streamed xacts may reorder. It happens if some small xacts have been >commited on master since the streamed one started, because we do not >start streaming immediately, but only after logical_work_mem hit. I >have performed some tests with conflicting xacts and it seems that >it's not a problem, since locking mechanism in Postgres guarantees >that if there would some deadlocks, they will happen earlier on >master. So if some records hit the WAL, it is safe to apply the >sequentially. Am I wrong? > I think you're right the way you interleave the changes ensures you can't introduce new deadlocks between transactions in this stream. I don't think reordering the blocks of streamed trasactions does matter, as long as the comit order is ensured in this case. >Anyway, I'm going to double check the safety of this part later. > OK. FWIW my understanding is that the speedup comes mostly from elimination of the serialization to a file. That however requires savepoints to handle aborts of subtransactions - I'm pretty sure I'd be trivial to create a workload where this will be much slower (with many aborts of large subtransactions). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: